kg-construct / rml-core Goto Github PK

View Code? Open in Web Editor NEW

12.0 8.0 8.0 4.41 MB

RML-Core: Main features for RDF generation with RML

Home Page: https://w3id.org/rml/core/spec

License: Creative Commons Attribution 4.0 International

HTML 31.32% Python 68.68%

ontology rdf rml rml-mapping shacl

rml-core's People

Contributors

Stargazers

Watchers

Forkers

pmaria samiscoding dylanvanassche aahmadai anaigmo dachafra elsdvlee chrdebru

rml-core's Issues

What about referencing object/term maps

The issue on joins is closed and a "referencing term maps" spec is mentioned, but where is it? The problem being is that the test cases still contain joins and some details need to be worked out.

Term Map definition in RML core

I get what you mean, it's also inherited reasoning from RML, but if we want to finally be 100% correct, we do need a superclass that allows us to bring everything correct. Then we would have something like this

rml:newSuperTermMap
- rml:TermMap
- rr:TermMap
- fnml:FunctionMap

where rml:TermMap, rr:TermMap and fnml:FunctionMap are subclasses of the newSuperTermMap

and that would also make everything more correct because the Term Map is actually abused in the case of RML, as an rr:TermMap should have one of column/template/reference and not reference that the rml:TermMap should have.

RML was on purposed designed to abuse R2RML vocab, but as it is now, the cleanest solution would be to revise the RML vocabulary to correctly generalize R2RML.

If we make things valid, ie as explained above, then I think the best would be to straight go for a correct definition of fnml:FunctionMap

(it might even be like the following, ie the rr:TermMap a subclass of rml:TermMap but that might not matter for the discussion here

rml:newSuperTermMap
- rml:TermMap
  - rr:TermMap
- fnml:FunctionMap)

Originally posted by @andimou in #11 (comment)

empty literals

should we generate a triple for empty literals?

if so, when should we do that?

Modules namespace, slash vs hash

Regarding the decision of which namespace should we use for the RML ontologies (core + modules), there are three options, and they also influence the use of hash or slash:

All modules with same namespace: All concepts share the same namespace, they are divided in different files and published independently from different repositories. Hash/slash indifferent.
Different namespace per module: Each module has a different namespace, including all the concepts they define. Also divided in different files, published independently, and indifferent hash/slash. Easier to manage each module, more difficult to remember to which module each term belongs. Example with hash and with slash.
Hybrid: Same namespace for all terms, but each module has a different IRI. The terms have a rdf:isDefinedBy property that points to the module IRI. Harder to publish with hash, it would require publishing with slash. Example.

How to handle programming language specific datatypes?

blank node constants?

The R2RML spec states:

If the constant-valued term map is a subject map, predicate map or graph map, then its constant value must be an IRI.

If the constant-valued term map is an object map, then its constant value must be an IRI or literal.

Blank nodes are not part of the options. Why this limitation?

Do we want to maintain this?

template with multiple references returning multiple values

Additional to #26 a reminder to add description of a template wich contains multiple references / expression that each return multiple values

Testcases: port from RML testcases

SHACL shapes in #61 are hard to validate because there are no test cases.
Let's port them from RML testcases to the new specification and make sure they are covering the complete specification.

Ontology: rdfs:isDefinedBy using incorrect ontology IRI

All rdfs:isDefinedBy statements are using an incorrect ontology IRI.

<http://w3id.org/rml/core/> is used, while the ontology IRI is <http://w3id.org/rml/core>

See e.g.

rml-core/ontology/documentation/ontology.ttl

Line 113 in 13653c8

rdfs:isDefinedBy <http://w3id.org/rml/core/> ;

while

rml-core/ontology/documentation/ontology.ttl

Line 9 in 13653c8

<http://w3id.org/rml/core> rdf:type owl:Ontology ;

Data error handling (e.g., lenient mode)

There are a couple of MUST statements in the spec that depend on the actual data that could be revisited to support some kind of 'lenient' mode of processing the mapping document.

For example: I have a mapping file with 10 triples maps that map 10 tables, which I use for mapping databases A and B (and many more others). 3 of those tables are optional in database B, crashing the mapping engine due to the fact that a logical source description MUST resolve to a data source (see eg R2RML spec)

lenient mode: process only the 7 resolvable triples maps
similar for The referenced columns of all term maps of a triples map (subject map, predicate maps, object maps, graph maps) MUST be column names that exist in the term map's logical table.: don't crash if you don't have all references, but generate all others

It's an open question on whether this is functionality of the mapping engine or a feature of the language, however I think it would be good to clarify which errors should result in exiting the engine vs handling gracefully

Developer-friendly DSL as part of the RML specification?

Many comment on RML as being verbose. RML mappings do not necessarily need to be crafted by hand, and multiple DSL-like languages have emerged, such as ShexML and YARRRML.
I propose to standardize (after RML is 'complete') a human-friendly way to edit RML documents, as these previous languages have shown that this will increase uptake of the spec.

Similar efforts: SHACL compact syntax

Logical source cardinality

I see that we currently have the logical source cardinality to be equal to 1,

i.e. each TriplesMap should have exactly 1 Logical Source.
this holds in v1 of the shapes and the current version of the spec.

Question is: do we keep this? Should it always be 1? Should it be at least 1?

inverse predicates in RML

An aspect missing from RML is the ability to generate a triple using an inverse predicate.

Specifiying the generation of triples with an inverse predicate would allow you to generate a triple in the inverse direction so object -> predicate -> subject in the context of the subject generating triples map.

Note: YARRRML also introduces inverse predicates.

Proposal:

Add a new construct rml:InversePredicateMap which is a term map that can be referenced from a rml:PredicateObjectMap using rml:inversePredicateMap.
Add a shortcut property rml:inversePredicate for rml:inversePredicateMap, analogous to rr:predicate.
Redefine rml:PredicateObjectMap such that it should have at least one rml:PredicateMap or rml:InversePredicateMap. This will allow for maximum flexibility in use.
Restrict the allowed term type of an rml:ObjectMap that is referenced by a rml:PredicateObjectMap which also rererences an rml:InversePredicateMap to IRIs and blank nodes.

Example:

<someTriplesMap>
  rml:logicalSource [] ;
  rr:subject ex:subject ;
  rr:predicateObjectMap [
    rr:predicateMap [
      rr:constant ex:parent ;
    ] ;
    rml:inversePredicateMap [
      rr:constant ex:child ;
    ] ;
    rr:object ex:object ;
  ] ;

or using shortcut properties:

<someTriplesMap>
  rml:logicalSource [] ;
  rr:subject ex:subject ;
  rr:predicateObjectMap [
    rr:predicate ex:parent ;
    rml:inversePredicate ex:child;
    rr:object ex:object ;
  ] ;

would generate:

ex:subject ex:parent ex:object .

ex:object ex:child ex:subject .

base IRI description in the spec

is the description/definition of the base IRI sufficient as it comes from the R2RML spec or do we need to further clarify certain aspects of it?

Github Actions: fix SHACL shapes test case runs

It seems that Github Actions is doing difficult, lets fix that.

Defining window operations in RML

Issue

Currently, there is no way to define windowing semantics in RML.
Windowing is crucial when evaluating joins between different live streaming
data sources.

Furthermore, windowing could also support buffering capabilities for
aggregation functions when processing streaming data sources. For example,
calculating an average of the values over the last 5 minutes.

Requirements

According to Gedik B.,
windows' behaviour is defined based on its type, and policies.

There are 2 main types of windows: tumbling, and sliding windows.
An illustration about how these windows work can be found here.
Note: Session window is a special case of tumbling window where the window
only gets dropped when inactivity threshold is violated.

The policies control when the windows evicts the tuples inside
the window (eviction policy), and when they triggers the processing of the
tuples using the operator logic defined inside the window (trigger policy).

Policies are further divided into 4 categories namely:

Count-based
- Uses the number of incoming tuples to inform when to evict/trigger.
Delta-based
- Uses a threshold of an attribute of the incoming tuples to
  inform when to evict/trigger. E.g. When the temperature value of a sensor is above 40C.
Time-based
- Uses the timestamp of the incoming tuple.
Punctuation-based
- Injects punctuations inside the incoming data stream as markers to decide
  when to evict/trigger.

Thus, we need a set of vocabulary to define and configure windows by
describing:

Window Type
Eviction policy
Trigger policy

The true semantics and combination of the policies are further explained by
Gedik B..

Example

Given the following RML with a join condition:

<#TM1> 
    rml:logicalSource <#STREAM1> ;
    rml:subjectMap <#SM1> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM1> ;
        rml:objectMap [ 
            rml:parentTriplesMap <#SM2>; 
            rr:joinCondition [
                rr:child "id";
                rr:parent "p_id"; 
            ];

        ];

    ]. 



<#TM2> 
    rml:logicalSource <#STREAM2> ;
    rml:subjectMap <#SM2> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM2> ;
        rml:objectMap <#OM2> ] .

Windows could be defined in the object map

<#TM1> 
    rml:logicalSource <#STREAM1> ;
    rml:subjectMap <#SM1> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM1> ;
        rml:objectMap [
            # Define the window to be used for joining
            rml:window [ 
                # Define window types 
                rml:windowType rml:Tumbling; 

                # Define the trigger policy for the window 
                # Every 5th record will execute the join
                rml:trigger [ a rml:CountPolicy
                    rml:countValue  5;

                ]; 

                # Define the eviction policy for the window
                # Clean up window after processing the 15th record
                rml:evict [ a rml:CountPolicy;
                    rml:countValue  15;
                ];

            ];
            rml:parentTriplesMap <#SM2>; 
            rr:joinCondition [
                rr:child "id";
                rr:parent "p_id"; 
            ];
        ];
    ]. 

<#TM2> 
    rml:logicalSource <#STREAM2> ;
    rml:subjectMap <#SM2> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM2> ;
        rml:objectMap <#OM2> ] .

(Data) Errors

The Data Errors subsection of [R2]RML core specification might need some updating, perhaps also considering revised test cases.

multiple values

Do we already have any specifications in the case of having a list of output values? I think it falls more into the behavior of the engine but still makes sense to have the definition of the correct behavior.

Originally posted by @samiscoding in https://github.com/kg-construct/rml-fno-spec/issues/7#issuecomment-810846015

how do we handle multiple values in RML?

parent and child terminology

R2RML specification names the rr:parent and rr:child as parent query and child query, would we keep this terminology?

Or should we skip the query part and keep them only parent and child? I have the impression when we talk we typically use these terms.

or something else? if so, what?

Transformation function over joined sources?

Having two sources (A and B) that can be joined by a field, would be possible to apply functions on some B fields and using them in any part of the TriplesMap from A?

Let me put an example.
Table A:

AC1	AC2
a	a1
b	b1

Table B

BC1	BC2
a1	"hello a"
b1	"hello b"

Output (applying uppercase to BC2)

<http://example.org/a> ex:predicate "HELLO A"^^xsd:string
<http://example.org/b> ex:predicate "HELLO B"^^xsd:string

I don't know if it is currently possible declaring the join-condition in the mapping rules or should we have to create an ad-hoc implemented function?

Issues in test cases

should we allow collections/containers to be in the place of a subject?

Apparently the RDF specification allows it

identifying blank nodes without an `rr:template` or `rml:reference`

In Section 4.1 Identifying collections is stated:

If no rr:template or rml:reference is provided for generating blank node IDs (rr:BlankNode) or IRIs (rr:IRI), then each iteration generates a new blank node identifier for the collection or container.

I think there is a general issue that we need to decide on here which is: Do we allow the generation of blank nodes without an explicit expression (rr:template, rml:reference, or other)?

Generating a random blank node id has impact on how you can implement joins on those terms, because there is no way to make the id generation repeatable. The R2RML spec does not allow blank node generation without an rr:template or rr:column, I assume for this reason.

So if we would want to allow this we would either have to:

Force processors into a way of implementing joins.
Have processors issue warnings when such a mapping is encountered, but it cannot guarantee correct results.
?

Add rml:Strategy and property rml:strategy description in RML-core spec

This is a follow up of kg-construct/rml-cc#19.
It's been agreed to create class rml:Strategy in rml-core, but keep the definition of instances rml:Append and rml:CartesianProduct in the collection-containers-spec.

Validity of template

I believe that checking the validity of templates should be included in the shapes. I'm not sure whether SPARQL's regular expressions allow for recursion, but it can be achieved by:

removing the unescaped curly braces from the template
checking whether the resulting string matches ^[^\{\}]*(?:\{[^\{\}]+\}[^\{\}]*)*$ (balanced and not nested)

This can be achieved for a SPARQL constraint component.

FnO output specification?

Just to summarize the current discussions and make sure I understand: the problem is that if a function is defined to return multiple outputs, FNML is ambiguous in which output to use.
Right now, implicitly, we assume to always take the first output. At the very least, this should be explicitly stated in the specification.
HOWEVER, that still gives problems if at some point you want to use the second output of a function :).
So we need a way to specify the output of the function, and Sam suggested using the SubjectMap for this.
(If above doesn't reflect the correct reasoning, please correct me and ignore my suggestions below ;) )

I personally think that there are the following options for specifying the output of the function:

explicitly state that the subjectmap within a FunctionTripleMaps has a quite different definition than subjectmaps within regular TripleMaps, namely smth like "The subjectMap of a FunctionTriplesMap generates the reference to the needed output of the function". I'm not for this option, since it's basically a redefinition and also hinders the generation of provenance data in the long run
provide a separate OutputMap within a FunctionTripleMaps with the definition "The outputMap of a FunctionTriplesMap generates the reference to the needed output of the function". The subjectMap's definition is untouched and can be used for provenance generation, and you can specify the output of the function
Specify this outputMap on the level of FunctionTermMap instead of FunctionTriplesMap (so you, e.g., use multiple outputs of the same FunctionTriplesMap for different (regular) TermMaps)

I'm in favor of this last option, see example below for what this entails

# Function description #

<ex:parseName> a fno:Function
  fno:expects ( [ fno:predicate ex:inputString ] )
  fno:returns ( [ fno:predicate ex:firstName] [fno:predicate ex:lastName ] ) .

# Mapping #

<#Person_Mapping>
    rml:logicalSource <#LogicalSource> ;                  # Specify the data source
    rr:subjectMap <#SubjectMap> ;                         # Specify the subject
    rr:predicateObjectMap <#FirstNameMapping> ;                # Specify the predicate-object-map
    rr:predicateObjectMap <#LastNameMapping> ,                # Specify the predicate-object-map

<#FirstNameMapping>
    rr:predicate foaf:firstName ;                              # Specify the predicate
    rr:objectMap <#FunctionTermMapFirstName> .                         # Specify the object-map
    
<#LastNameMapping>
    rr:predicate foaf:lastName ;                              # Specify the predicate
    rr:objectMap <#FunctionTermMapLastName> .                         # Specify the object-map

<#FunctionTermMapFirstName>
    fnml:functionValue parseNameFunctionTriplesMap ;
    fnml:outputValue ex:firstName .

<#FunctionTermMapLastName>
    fnml:functionValue parseNameFunctionTriplesMap ;
    fnml:outputValue ex:lastname .

<#parseNameFunctionTriplesMap>
    a fnml:FunctionTriplesMap ;
    rr:predicateObjectMap [
        rr:predicate fno:executes ;                   # Execute the function
        rr:objectMap [ rr:constant ex:parseName ] # ex:parseName
    ] ;
    rr:predicateObjectMap [
        rr:predicate ex:inputString ;
        rr:objectMap [ rr:reference "name" ]          # Use as input the "name" reference
    ] .

# When given the reference "name", e.g. value "Ben De Meester", this functiontriplesMap will generate following triples:
_:a # Blank node, because no SubjectMap is given
  fno:executes ex:parseName ;
  ex:inputString "Ben De Meester" .

# After execution, following triples will be generated
_:a # Same blank node
  ex:firstName "Ben" ;
  ex:lastName "De Meester" .

The RML namespaces do not dereference

Observation

Some namespaces that are used in RML configuration files do not dereference. It is therefore not possible to obtain an RDF representation of these vocabularies.

IRI dereference is useful, since this allows vocabularies to be pulled into any standard-compliant environment, using a simple HTTP request.

MWE 1

The following cURL request reproduces the dereference that is performed by TriplyDB, but this should be very similar to how any other standards-conforming linked data client sends such requests:

 curl -vL -H 'Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7' 'http://semweb.mmlab.be/ns/ql#'
*   Trying 193.191.148.200:80...
* Connected to semweb.mmlab.be (193.191.148.200) port 80
> GET /ns/ql HTTP/1.1
> Host: semweb.mmlab.be
> Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7
>
< HTTP/1.1 404 Not Found
< Server: nginx/1.14.0 (Ubuntu)
< Date: Sun, 11 Feb 2024 08:55:31 GMT
< Content-Type: text/html; charset=utf-8
< Content-Length: 12
< Connection: keep-alive
< X-Powered-By: Express
< ETag: "703595115"

Notice that the 'ql' vocabulary does not exist.

MWE 2

The following cURL request reproduces the dereference that is performed by TriplyDB, but this should be very similar to how any other standards-conforming linked data client sends such requests:

curl -vL -H 'Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7' 'http://www.w3.org/ns/r2rml#' > aap
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying [2606:4700::6812:1613]:80...
* Connected to www.w3.org (2606:4700::6812:1613) port 80
> GET /ns/r2rml HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/8.4.0
> Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7
>
< HTTP/1.1 301 Moved Permanently
< Date: Sun, 11 Feb 2024 09:05:08 GMT
< Transfer-Encoding: chunked
< Connection: keep-alive
< Cache-Control: max-age=3600
< Expires: Sun, 11 Feb 2024 10:05:08 GMT
< Location: https://www.w3.org/ns/r2rml
< Set-Cookie: __cf_bm=jBbWIn71PDCr7f80XLmWc0dTMUnSLHJwXOt9OWTrpKc-1707642308-1-AetC7y7UHMuoI4vdIMnsELUEU6fAEyQalSKFTSyBD4x4rsb61a8khjk+oPEBlmnXo79h7d6zSAwZdHXwomNgAW4=; path=/; expires=Sun, 11-Feb-24 09:35:08 GMT; domain=.w3.org; HttpOnly; SameSite=None
< Server: cloudflare
< CF-RAY: 853b6da918b40e40-AMS
< alt-svc: h3=":443"; ma=86400
<
* Ignoring the response-body
{ [5 bytes data]
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Connection #0 to host www.w3.org left intact
* Clear auth, redirects to port from 80 to 443
* Issue another request to this URL: 'https://www.w3.org/ns/r2rml'
*   Trying [2606:4700::6812:1613]:443...
* Connected to www.w3.org (2606:4700::6812:1613) port 443
* schannel: disabled automatic use of client certificate
* ALPN: curl offers http/1.1
* ALPN: server accepted http/1.1
* using HTTP/1.1
> GET /ns/r2rml HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/8.4.0
> Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7
>
* schannel: remote party requests renegotiation
* schannel: renegotiating SSL/TLS connection
* schannel: SSL/TLS connection renegotiated
* schannel: failed to decrypt data, need more data
< HTTP/1.1 200 OK
< Date: Sun, 11 Feb 2024 09:05:08 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< content-location: r2rml.html
< vary: negotiate,accept,Accept-Encoding
< tcn: choice
< last-modified: Mon, 17 Sep 2012 15:21:58 GMT
< etag: W/"1818d-4c9e7559fb180;a0-4939a0734f380
< Cache-Control: max-age=21600
< expires: Sun, 11 Feb 2024 15:05:08 GMT
< x-backend: www-mirrors
< x-request-id: 853b6da9bc5d6572
< strict-transport-security: max-age=15552000; includeSubdomains; preload
< content-security-policy: frame-ancestors 'self' https://cms.w3.org/; upgrade-insecure-requests
< CF-Cache-Status: BYPASS
< Set-Cookie: __cf_bm=7cDNNh9LZo1Y8n8aNwGfBd8DVwH3YTA9o5ef8kMnAXs-1707642308-1-ATwrFtMqqk6DhOQy6Oc0oBj4wSERNIUTH6h+x4xfuHHtbtg52f2QT6kkJ8dRknK0TXoyaik1f7/Vg6N87o1hMaE=; path=/; expires=Sun, 11-Feb-24 09:35:08 GMT; domain=.w3.org; HttpOnly; Secure; SameSite=None
< Server: cloudflare
< CF-RAY: 853b6da9bc5d6572-AMS
< alt-svc: h3=":443"; ma=86400
<
{ [11336 bytes data]
100 98701    0 98701    0     0   259k      0 --:--:-- --:--:-- --:--:--     0
* Connection #1 to host www.w3.org left intact

Notice that the 'rr'/'r2rml' vocabulary exists, but is not available in an RDF serialization format.

MWE 3

 curl -vL -H 'Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7' 'http://www.w3.org/ns/csvw#' > aap
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying [2606:4700::6812:1713]:80...
* Connected to www.w3.org (2606:4700::6812:1713) port 80
> GET /ns/csvw HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/8.4.0
> Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7
>
< HTTP/1.1 301 Moved Permanently
< Date: Sun, 11 Feb 2024 09:32:16 GMT
< Transfer-Encoding: chunked
< Connection: keep-alive
< Cache-Control: max-age=3600
< Expires: Sun, 11 Feb 2024 10:32:16 GMT
< Location: https://www.w3.org/ns/csvw
< Set-Cookie: __cf_bm=iOe70axn1ua.4ohv_Y.cH9yRby0WSFMGAjmzgJyrYKU-1707643936-1-AcNjXt1N40OIS6F0aOVweENzSjT8ag0qSDRRNJusBhq5DHAXg0rRJGOInPYLu45zM7SjwJI50Kqq6cuhwZXN2P8=; path=/; expires=Sun, 11-Feb-24 10:02:16 GMT; domain=.w3.org; HttpOnly; SameSite=None
< Server: cloudflare
< CF-RAY: 853b956df9e466f7-AMS
< alt-svc: h3=":443"; ma=86400
<
* Ignoring the response-body
{ [5 bytes data]
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Connection #0 to host www.w3.org left intact
* Clear auth, redirects to port from 80 to 443
* Issue another request to this URL: 'https://www.w3.org/ns/csvw'
*   Trying [2606:4700::6812:1713]:443...
* Connected to www.w3.org (2606:4700::6812:1713) port 443
* schannel: disabled automatic use of client certificate
* ALPN: curl offers http/1.1
* ALPN: server accepted http/1.1
* using HTTP/1.1
> GET /ns/csvw HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/8.4.0
> Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7
>
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* schannel: remote party requests renegotiation
* schannel: renegotiating SSL/TLS connection
* schannel: SSL/TLS connection renegotiated
* schannel: failed to decrypt data, need more data
< HTTP/1.1 200 OK
< Date: Sun, 11 Feb 2024 09:32:17 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< content-location: csvw.html
< vary: negotiate,accept,Accept-Encoding
< tcn: choice
< last-modified: Mon, 08 Oct 2018 10:13:20 GMT
< etag: W/"18ee4-577b4ded6f000;9f-50ab8a8466840
< Cache-Control: max-age=21600
< expires: Sun, 11 Feb 2024 15:32:17 GMT
< x-backend: www-mirrors
< x-request-id: 853b956eaa726621
< strict-transport-security: max-age=15552000; includeSubdomains; preload
< content-security-policy: frame-ancestors 'self' https://cms.w3.org/; upgrade-insecure-requests
< CF-Cache-Status: BYPASS
< Set-Cookie: __cf_bm=WEw9_l4uyxFUnuconnVFa9rHaXt6lr61F5IPvGJKtxg-1707643937-1-AUsPiDkr+ZoFfbXbGRap+pX5GEIGjZcJoC6bNoGr5ornXc+l7FchvMzFwb74Iu1lN0rmoUa9v5Fl9ZY3vGIO/pE=; path=/; expires=Sun, 11-Feb-24 10:02:17 GMT; domain=.w3.org; HttpOnly; Secure; SameSite=None
< Server: cloudflare
< CF-RAY: 853b956eaa726621-AMS
< alt-svc: h3=":443"; ma=86400
<
{ [25027 bytes data]
100   99k    0   99k    0     0   210k      0 --:--:-- --:--:-- --:--:--  727k
* Connection #1 to host www.w3.org left intact
*

Notice that 'csvw' only exists in HTML (but not in any RDF format).

Expected

All vocabularies that are commonly used in RML configuration files to be available through IRI dereferencing.

IRI strategy for the specifications

Eventually we need to decided our IRI strategy for the different modules of the specifications.

Any suggestions?

logical source reference from the core part of the spec

The core description has references to some input data but if we split the logical source and core part of the spec, how should we handle the references of the examples at the core part?

Should we introduce an abstract running example? or should we still have examples for all exemplary data sources?

Join specification when logical source is the same

Let's say we have two triple maps that refer to the same logical source (and with same, we really mean same URI, not "same because the descriptions lead to the semantically same logical source").

Sample source (CSV)

id,parent_id
1,2
2,1

Base mapping (YARRRML)

prefixes
  ex: http://example.com#
sources:
  test: [data.csv]
mappings:
  test1:
    s: ex:$(id)
    po:
      p: ex:parent
      o:
        mapping: test2
  test2:
    s: ex:$(parent_id)

We have following use cases that are underspecified in de spec

the spec currently says If the logical source of the child triples map and the logical source of the parent triples map of a referencing object map are not identical, then the referencing object map must have at least one join condition.

If a join condition is specified AND the logical source is not the same: common case, execute join condition between each iteration pair
If a join condition is specified AND the logical source is the same: same as above
If no join condition is specified AND the logical source is not the same: do a full join (i.e., take all iterations into account)

example output: ex:1 ex:parent ex:2, ex:1 ex:parent ex:1, ex:2 ex:parent ex:2, ex:2 ex:parent ex:1

If no join condition is specified AND the logical source is the same: don't do a full join, but take the current iteration into account

example output: ex:1 ex:parent ex:2, ex:2 ex:parent ex:1
this last one is the edge case, but allows to 'join per iteration'. Question is: should we make this edge case explicit, or should there be a different way to tackle this edge case?

FnO - Does a Function Triples Map need a Logical Source?

The current use case for a LogicalSource definition on a FunctionTriplesMap seems to be:

The ability to generate values from a different source and use these values as the result of a Function Term Map.

An example of this is included in one of the proposed FnO test cases: RMLFNOTC009

However, since a FunctionTriplesMap doesn't generate values directly, but generates intermediate function execution triples expressed in FnO, the question of how to handle joins between a TriplesMap and a FunctionTriplesMap with a different LogicalSource arises.

As this is not the same type of join as a join on a RefObjectMap this join would have to be defined. Subsequently, this would require another specific type of join to be implemented by engines.

At the same time we have a very similar mapping challenge for generating literal values by a joining different logical sources: join-on-literal challenge.

I believe it would be advantageous to come up with a solution that covers both generating literals from different LogicalSources using joins, as generating function values from different LogicalSources.
As this solution would not be specific to functions, I think we should look for a solution in the definition of LogicalSources. (pinging @thomas-delva)

Add section for implementation considerations

There are several considerations that can be made when implementing RML+FnO. Some of these might need to be specified in their own section, others might be too specific too certain implementations, but might still be interesting to mention. For now we can add a section 'Implementation considerations' to collect these, as discussed in this slack thread

Examples of issues:

Handling programming language specific datatypes
Special handling of RDF term types in used software library (e.g. from RDFJS, JENA, etc)
- e.g. what if a function returns an RDF literal with language tag?
What should an engine do with the generated FnO execution triples (from the function triples map)?

all potential combinations to generate all possible outputs

After our discussions, I created an Excel sheet where we can include all possible combinations.

The excel sheet is available at https://docs.google.com/spreadsheets/d/163K5XoeudPyvEWzlc9oTRqPwHCwGHUopu4mJMGPndR0/edit

I create this issue to provide the link and use it as reference for other issues that refer to this Excel sheet but discuss more detailed topics

reference definition

We need to agree on how we are going to call the rml:reference values.

Mistakes in the shapes

rml:cartessianProduct should be rml:cartesianProduct (2 occurences)
The message for RMLLanguageMapShape seems incorrect: "rml:LanguageMap must specify an rml:template or rml:constant with the IRI of the language." --> Languages do not have an IRI.

IRIs for modules

Since we are adopting solution (3) from #56, we need to decide the namespace for each module. Here is a first proposal:

Base for all resources: http://w3id.org/rml/
Core: http://w3id.org/rml/core/
Source & Target: http://w3id.org/rml/source-target/
Collections and Containers: http://w3id.org/rml/collections-containers/
FNML: http://w3id.org/rml/fnml/
RML-star: http://w3id.org/rml/star/

Any suggestion of change of any module is welcome

SHACL shapes for current RML

@DylanVanAssche and @pmaria should already have some shapes, let's start with these, what they cover and what they don't and how they need to be adjusted to fit to the new spec of RML.

CI/CD workflows for spec repositories

Proposal sketched up with @DylanVanAssche for CI/CD workflows for the spec repositories, to keep specs and ontology/shapes in line, and also do some automatic validations / generation where applicable.

shapes coverage test

Create a GitHub action to be used by all spec repo's that contains generic CI/CD functionality.
As input the action 0 or more other spec-repositories can be specified on which this spec-repo is dependent.

The action then collects the necessary artefacts from the repo's and executes the following steps:

Syntax-validation/parse ontologies (if error is encountered give clear msg with offending repo and file)
Validate / Test shapes
Validate examples

All spec repo's get the following directory structure

[spec-repo]
├── model/
│   ├── ontology/       # Ontology reflecting the spec ( this will be combined with rml-core and possibly other specs )
│   └── shapes/         # Shapes reflecting the spec
│       └── tests/      # Test cases which cover the shapes
└── examples/           # Examples used in the spec

Under model/shapes/tests we have test cases that cover the shapes reflecting the spec developed in the current repo.
This helps us ensure that our shapes and ontology are valid and stay up to date with the spec.

Under examples/ we have all examples that are used in the text of the spec. We place them in this standard location
so that they can also be validated against the shapes used in the previous step.

On each push to any branch, these tests will be run.

PR templates

We will introduce PR templates including a checklist which reminds us to:

make sure that the model is in sync with the spec.

Generate combined model

We can introduce a merge action which triggers a commit or PR to update the full model (in rml-core repo?)

Templates, safe separators, and null-values

TL;DR

R2RML defines a safe separator that should be used when a template contains more than one value/column reference. This issue discusses the use case for this, how null values in value references in templates should be handled, and tries to determine how current (R2)RMLprocessors handle this.

Description

In the definition of templates in R2RML the following is stated:

If a template contains multiple pairs of unescaped curly braces, then any pair SHOULD be separated from the next one by a safe separator. This is any character or string that does not occur anywhere in any of the data values of either referenced column; or in the IRI-safe versions of the data values, if the term type is rr:IRI (see note below).

A few things of note here:

First off, the usage of keyword SHOULD instead of MUST. So engines can deviate from this for valid reasons in particular cirumstances? What are those reasons?
So a template like "http://example.com/{}{}" SHOULD actually not be used/allowed.
How does one know upfront what "character or string that does not occur anywhere in any of the data values of either referenced column; or in the IRI-safe versions of the data values"? Or should this be validated upon evaluating the template?
This doesn't seem to be covered in the R2RML test cases, other than that all templates with with multiple pairs of unescaped curly braces do seem to have a separator.

What happens when one of the references in a template with multiple pairs of unescaped curly braces is NULL? The specified algorithm is actually not clear on this:

1. Let result be the template string
2. For each pair of unescaped curly braces in result:
    1. Let value be the data value of the column whose name is enclosed in the curly braces
    2. If value is NULL, then return NULL
    3. Let value be the natural RDF lexical form corresponding to value
    4. If the term type is rr:IRI, then replace the pair of curly braces with an IRI-safe version of value; otherwise, replace the pair of curly braces with value
3. Return result

Does returning null at 2.2 mean returning null for evaluation of the entire template? Or just for that pair of curly braces? And if the latter, why mention this at all in the algorithm?

However if the former, why would there be a need for a safe separator?

The only use case I can imagine for a safe separator is to not get clashes when a referenced value in a template with multiple referenced values is empty.

Given

A	B
A	~
A~	NULL

, with a template without a safe separator "http://example.com/{A}{B}" and not returning null for the template when one of its referenced values is null, this would result in

http://example.com/A~
http://example.com/A~

, a (undesired??) clash.

With a template with a safe separator "http://example.com/{A}-{B}" and not returning null for the template when one of its references values is null, this would result in

http://example.com/A-~
http://example.com/A~-

, no clash.

Of course with or without a safe separator, if a template should be evaluated to null if one of its referenced values is null, in both cases the result for "http://example.com/{A}{B}" would be:

http://example.com/A~

However, safe separators don't actually solve clashes when the same value reference can contain empty strings and NULLs. Given:

A	B
A	~
A~	NULL
A~

In the first case you'd get all clashes:

http://example.com/A~
http://example.com/A~
http://example.com/A~

In the second case you'd get one clash:

http://example.com/A-~
http://example.com/A~-
http://example.com/A~-

In the third case (although it probably depends on the processor) you'd get one clash:

http://example.com/A~
http://example.com/A~

Questions

Is there a use case to generate a value for a template when one of its referenced values is null?

a. If yes, what happens if all referenced values are null?
Do we want to maintain safe separators?
How do current (R2)RML processors handle this? (Since this is not covered in the R2RML test cases)
Is there another use case for safe separators that I'm missing?

Overview template implementation in current processors

Processor	Requires safe separator in templates	Returns value for templates with one or more null references
CARML	NO	NO
RMLMapper	NO	NO
Morph-KGC	NO	NO
SDM-RDFizer	NO	NO
Ontop	NO	NO
R2RML-F	NO	NO

Specify rr:parent as a shortcut reference property of rr:parentMap (and similar for rr:child as shortcut reference property for rr:childMap)

The RMLMapper supports rml:parentTermMap: https://github.com/RMLio/rmlmapper-java/search?q=parentTermMap

:om_5 a rr:ObjectMap;
    rr:parent "friendID".

EQUALS

:om_5 a rr:ObjectMap;
    rml:parentTermMap :ptm_0.
:ptm_0 rml:reference "friendID".

This allows us to support joining on, e.g. constants (which is interesting when you want to join on IDs AND, e.g. filter on constant values in the join condition) or templates or function values.
It's a subtle extension, but greatly increases the complexity of the join operator.
Even if joining in RML gets a revamp, this is mostly a suggestion to allow for term maps instead of only references.

function map definition

@andimou 's comment (putting it here so the discussion doesn't get lost):

I think when we defined this back in the days we had in mind a superclass of Term Map and Function Map as it might be a bit incorrect to say that the Function Map is a type of Term Map, because a Term Map expects at least one of the reference/template/constant or none but a term type to specify it's a blank node, whereas the Function Map as you also mention, expects a triples map.

In principle it's something like, an RDF term is generated by

a Term Map, which may be

reference-based

constant-based

template-based

none of the above but with term type blank node

Function Map
ie a function map is an alternative to what nowadays is a term map

(r I don't remember correct?)

subject is IRI?

The R2RML spec says

The subjects often are IRIs that are generated from the primary key column(s) of the table.

does this often refer to the IRIs or the primary key column?

I would think that this would be translate to

The subjects ~~often~~ are IRIs ~~that are generated from the primary key column(s) of the table~~.

or could they be anything else as well?

Ability to generate URI terms next to IRI terms

In a recent discussion in kg-construct/rml-questions#28 that it might be useful to support generation of URIs next to IRIs would be to facilitate Linked Data dereferencing. As HTTP only supports URIs, implementing IRI dereferencing inevitably requires IRI to URI mapping below the surface (For example how DBpedia did it). It might therefor be valuable to also be able to generate URIs.

Note by @DylanVanAssche on this:

Maybe we need to have rr:IRI and rr:URI in the new spec as rr:termType?
This way, mapping rules can explicitly being clear about this.

Is this indeed a useful enough feature to add to RML?

no section of default mapping generator in RML-core

The R2RML spec includes the following paragraph for duplicates handling:

Duplicate row preservation: For tables without a primary key, the Direct Graph requires that a fresh blank node is created for each row. This ensures that duplicate rows in such tables are preserved. This requirement is relaxed for R2RML default mappings: They MAY reuse the same blank node for multiple duplicate rows. This behaviour does not preserve duplicate rows. R2RML default mapping generators that provide default mappings based on the Direct Graph MUST document whether the generated default mapping preserves duplicate rows or not.

This needs to be adjusted in the case of heterogeneous data

Extending usage of GraphMap at PredicateMap and ObjectMap level

Issue

Currently in the proposed spec, rml:graphMap is only usable under PredicateObjectMap and SubjectMap.

The current spec does not allow fine tuning where you want to put the named graphs at the predicate/object level without considering things at the PredicateObjectMap level.
In my opinion, enabling usage of rml:graphMap at all term map levels (SubjectMap, PredicateMap, and ObjectMap) will give more control to the user to decide, which graph map the output will belong to, at the term level.
It will also reduce redundancy when writing RML documents by keeping closely related POMs at one place instead of being spread out all over the document.

Example

Suppose you want all triples with predicate <predicate AB> to be in the named graphs <graph A> and <graph B>.
But, you want only <predicate C> in named graph <graph C> with the same object <object>.

<subject> <predicateAB> <object> <graphA>.
<subject> <predicateAB> <object> <graphB>.
<subject> <predicateC> <object> <graphC>.

Current RML mapping to achieve this:

  rml:subjectMap [
    rml:template "subject"
  ];
 
    rml:predicateObjectMap [
      rml:predicateMap [
          rml:constant ex:predicateAB
      ];
     rml:graph   ex:graphA;
     rml:graph   ex:graphB;
      rml:object  "object"
      ];

    rml:predicateObjectMap [        
        rml:predicateMap [
            rml:constant ex:predicateC;         
        ];
        rml:graph   ex:graphC;
        rml:object  "object"
    ].

If GraphMaps could be defined at PredicateMap level, the following would be possible:

  rml:subjectMap [
    rml:template "subject"
  ];
 
    rml:predicateObjectMap [
      rml:predicateMap [
          rml:constant ex:predicateAB;
          rml:graph  ex:graphA; 
          rml:graph  ex:graphB;
     ];
      rml:predicateMap [
          rml:constant  ex:predicateC; 
          rml:graph ex:graphC; 
      ];
           rml:object  "object"
      ].

This also better aligns with FnO where FunctionMaps could be applied at every TermMap level (SubjectMap, PredicateMap, and ObjectMap)

Desired solution

rr:template: also provide an URI-unsafe alternative?

Right now, termmaps with rml:template and termtype rml:IRI always does percent encoding and there is a note that non-encoded values can only be achieved using rml:reference and thus preprocessing is needed spec. Alternatively, a FnO string concatenation function could be used.

However, this turns out to be a very common case, e.g. creating a mailto: resource from an emailadres: you just want to be able to do rml:template "mailto:{mailadres}", but that doesn't work because the @ will be percent-encoded (see RMLio/rmlmapper-java#219 (comment) for a recent similar request).

Does it make sense to introduce a rml:templateUnsafe predicate or similar to have an easy-to-use construct for handling this common request? This does the same as rml:template, but never performs a reference value transforming function

predicate could be a blank node?

The RML spec states the following:

Both predicate maps and object maps are term maps.

But a term map can be a blank node.

Would this be "misleading"?
Does this mean that a predicate could be a blank node?
Obviously, this is not correct: a predicate should only be an IRI.

Separate sections for ExpressionMap and TermMaps in spec

Currently in the spec, the Expression Map is introduced as a subsection of the Term Maps section. This is strange, since it is something that is broader than a term map and is also used as a super class by non-term map constructs.

Furthermore, the description of Term Maps and Expression Map is combined in sections like "Reference-valued Expression Maps and term types".
The spec also currently states things like:

A reference-valued Expression Map generates an RDF Term which is by default a Literal.

This has deviated significantly from how it was described at first and not as originally intended. The idea was to introduce expression maps not as generating RDF terms, but simply as generating values on an input source using an expression.
This to allow it to be used for more things than only generating terms. For example join conditions.

I think the current mixing of Term Maps and expression map makes the spec difficult to follow. Also, it makes it harder for the other documents that need to reference sections about Expression Map aspects without Term Map aspects to do so.

I would prefer to see the Expression Map first introduced in a separate section as abstract functions on the input source, followed by a section about term maps which then concretely describes.

Some bad hyperlinks in the spec

For instance, in Section 2.2 Mapping Graphs and the RML Vocabulary, there are a bunch of links to the spec and not a specific source. Specific nodes cannot be found in the spec either (e.g., Logical Source).

Joins in RML

How joins are currently handled:

We refer to a Parent Triples Map but in fact we use the Subject Map of the Triples Map.

<#TM1> 
    rml:logicalSource <LS1> ;
    rml:subjectMap <#SM1> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM1> ;
        rml:objectMap [ rml:parentTriplesMap <#SM2> ] ]. 

<#TM2> 
    rml:logicalSource <LS2> ;
    rml:subjectMap <#SM2> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM2> ;
        rml:objectMap <#OM2> ] .

Limitations:

This limits us because a Subject Map can only be an IRI or a Blank Node but it cannot be a literal.

It also limits us as we can only join the object of the one triple with the subject of another triple.

Solution:
Actually refer to the RDF term which we want to reuse instead of the Triples Map.

In this case, the above would become:

<#TM1> 
    rml:logicalSource <LS1> ;
    rml:subjectMap <#SM1> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM1> ;
        rml:objectMap [ rml:parentTermMap <#SM2> ] ]. 

<#TM2> 
    rml:logicalSource <LS2> ;
    rml:subjectMap <#SM2> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM2> ;
        rml:objectMap <#OM2> ] .

(rml:parentTermMap would be introduced to refer to the Term we want to reuse)

We can reuse the above to even refer to the object of <#TM2>, e.g.:

<#TM1> 
    rml:logicalSource <LS1> ;
    rml:subjectMap <#SM1> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM1> ;
        rml:objectMap [ rml:parentTermMap <#OM2> ] ]. 

<#TM2> 
    rml:logicalSource <LS2> ;
    rml:subjectMap <#SM2> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM2> ;
        rml:objectMap <#OM2> ] .

and we can do that in any Term Map, e.g., in the following we join the Subject Map of the one TM with the Subject Map of the other TM:

<#TM1> 
    rml:logicalSource <LS1> ;
    rml:subjectMap [ rml:parentTermMap <#SM2> ] ;

<#TM2> 
    rml:logicalSource <LS2> ;
    rml:subjectMap <#SM2> .

That proposal would solve the issues with joins for literals but not the cases where we want to use data from two data sources as in kg-construct/rml-jc#3.

kg-construct / rml-core Goto Github PK

rml-core's People

Contributors

Stargazers

Watchers

Forkers

rml-core's Issues

Issue

Requirements

Example

Observation

MWE 1

MWE 2

MWE 3

Expected

shapes coverage test

PR templates

Generate combined model

TL;DR

Description

Questions

Overview template implementation in current processors

Issue

Example

Desired solution

Recommend Projects

Recommend Topics

Recommend Org