Giter Club home page Giter Club logo

rml-core's People

Contributors

anaigmo avatar andimou avatar bjdmeest avatar chrdebru avatar dachafra avatar dylanvanassche avatar elsdvlee avatar pmaria avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rml-core's Issues

What about referencing object/term maps

The issue on joins is closed and a "referencing term maps" spec is mentioned, but where is it? The problem being is that the test cases still contain joins and some details need to be worked out.

Term Map definition in RML core

I get what you mean, it's also inherited reasoning from RML, but if we want to finally be 100% correct, we do need a superclass that allows us to bring everything correct. Then we would have something like this

  • rml:newSuperTermMap
    • rml:TermMap
    • rr:TermMap
    • fnml:FunctionMap

where rml:TermMap, rr:TermMap and fnml:FunctionMap are subclasses of the newSuperTermMap

and that would also make everything more correct because the Term Map is actually abused in the case of RML, as an rr:TermMap should have one of column/template/reference and not reference that the rml:TermMap should have.

RML was on purposed designed to abuse R2RML vocab, but as it is now, the cleanest solution would be to revise the RML vocabulary to correctly generalize R2RML.

If we make things valid, ie as explained above, then I think the best would be to straight go for a correct definition of fnml:FunctionMap

(it might even be like the following, ie the rr:TermMap a subclass of rml:TermMap but that might not matter for the discussion here

  • rml:newSuperTermMap
    • rml:TermMap
      • rr:TermMap
    • fnml:FunctionMap)

Originally posted by @andimou in #11 (comment)

empty literals

should we generate a triple for empty literals?

if so, when should we do that?

Modules namespace, slash vs hash

Regarding the decision of which namespace should we use for the RML ontologies (core + modules), there are three options, and they also influence the use of hash or slash:

  1. All modules with same namespace: All concepts share the same namespace, they are divided in different files and published independently from different repositories. Hash/slash indifferent.
  2. Different namespace per module: Each module has a different namespace, including all the concepts they define. Also divided in different files, published independently, and indifferent hash/slash. Easier to manage each module, more difficult to remember to which module each term belongs. Example with hash and with slash.
  3. Hybrid: Same namespace for all terms, but each module has a different IRI. The terms have a rdf:isDefinedBy property that points to the module IRI. Harder to publish with hash, it would require publishing with slash. Example.

Testcases: port from RML testcases

SHACL shapes in #61 are hard to validate because there are no test cases.
Let's port them from RML testcases to the new specification and make sure they are covering the complete specification.

Data error handling (e.g., lenient mode)

There are a couple of MUST statements in the spec that depend on the actual data that could be revisited to support some kind of 'lenient' mode of processing the mapping document.

For example: I have a mapping file with 10 triples maps that map 10 tables, which I use for mapping databases A and B (and many more others). 3 of those tables are optional in database B, crashing the mapping engine due to the fact that a logical source description MUST resolve to a data source (see eg R2RML spec)

  • lenient mode: process only the 7 resolvable triples maps
  • similar for The referenced columns of all term maps of a triples map (subject map, predicate maps, object maps, graph maps) MUST be column names that exist in the term map's logical table.: don't crash if you don't have all references, but generate all others

It's an open question on whether this is functionality of the mapping engine or a feature of the language, however I think it would be good to clarify which errors should result in exiting the engine vs handling gracefully

Logical source cardinality

I see that we currently have the logical source cardinality to be equal to 1,

i.e. each TriplesMap should have exactly 1 Logical Source.
this holds in v1 of the shapes and the current version of the spec.

Question is: do we keep this? Should it always be 1? Should it be at least 1?

inverse predicates in RML

An aspect missing from RML is the ability to generate a triple using an inverse predicate.

Specifiying the generation of triples with an inverse predicate would allow you to generate a triple in the inverse direction so object -> predicate -> subject in the context of the subject generating triples map.

Note: YARRRML also introduces inverse predicates.

Proposal:

  • Add a new construct rml:InversePredicateMap which is a term map that can be referenced from a rml:PredicateObjectMap using rml:inversePredicateMap.
  • Add a shortcut property rml:inversePredicate for rml:inversePredicateMap, analogous to rr:predicate.
  • Redefine rml:PredicateObjectMap such that it should have at least one rml:PredicateMap or rml:InversePredicateMap. This will allow for maximum flexibility in use.
  • Restrict the allowed term type of an rml:ObjectMap that is referenced by a rml:PredicateObjectMap which also rererences an rml:InversePredicateMap to IRIs and blank nodes.

Example:

<someTriplesMap>
  rml:logicalSource [] ;
  rr:subject ex:subject ;
  rr:predicateObjectMap [
    rr:predicateMap [
      rr:constant ex:parent ;
    ] ;
    rml:inversePredicateMap [
      rr:constant ex:child ;
    ] ;
    rr:object ex:object ;
  ] ;

or using shortcut properties:

<someTriplesMap>
  rml:logicalSource [] ;
  rr:subject ex:subject ;
  rr:predicateObjectMap [
    rr:predicate ex:parent ;
    rml:inversePredicate ex:child;
    rr:object ex:object ;
  ] ;

would generate:

ex:subject ex:parent ex:object .

ex:object ex:child ex:subject .

base IRI description in the spec

is the description/definition of the base IRI sufficient as it comes from the R2RML spec or do we need to further clarify certain aspects of it?

Defining window operations in RML

Issue

Currently, there is no way to define windowing semantics in RML.
Windowing is crucial when evaluating joins between different live streaming
data sources.

Furthermore, windowing could also support buffering capabilities for
aggregation functions when processing streaming data sources. For example,
calculating an average of the values over the last 5 minutes.

Requirements

According to Gedik B.,
windows' behaviour is defined based on its type, and policies.

There are 2 main types of windows: tumbling, and sliding windows.
An illustration about how these windows work can be found here.
Note: Session window is a special case of tumbling window where the window
only gets dropped when inactivity threshold is violated.

The policies control when the windows evicts the tuples inside
the window (eviction policy), and when they triggers the processing of the
tuples using the operator logic defined inside the window (trigger policy).

Policies are further divided into 4 categories namely:

  1. Count-based
    • Uses the number of incoming tuples to inform when to evict/trigger.
  2. Delta-based
    • Uses a threshold of an attribute of the incoming tuples to
      inform when to evict/trigger. E.g. When the temperature value of a sensor is above 40C.
  3. Time-based
    • Uses the timestamp of the incoming tuple.
  4. Punctuation-based
    • Injects punctuations inside the incoming data stream as markers to decide
      when to evict/trigger.

Thus, we need a set of vocabulary to define and configure windows by
describing:

  1. Window Type
  2. Eviction policy
  3. Trigger policy

The true semantics and combination of the policies are further explained by
Gedik B..

Example

Given the following RML with a join condition:

<#TM1> 
    rml:logicalSource <#STREAM1> ;
    rml:subjectMap <#SM1> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM1> ;
        rml:objectMap [ 
            rml:parentTriplesMap <#SM2>; 
            rr:joinCondition [
                rr:child "id";
                rr:parent "p_id"; 
            ];

        ];

    ]. 



<#TM2> 
    rml:logicalSource <#STREAM2> ;
    rml:subjectMap <#SM2> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM2> ;
        rml:objectMap <#OM2> ] .

Windows could be defined in the object map

<#TM1> 
    rml:logicalSource <#STREAM1> ;
    rml:subjectMap <#SM1> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM1> ;
        rml:objectMap [
            # Define the window to be used for joining
            rml:window [ 
                # Define window types 
                rml:windowType rml:Tumbling; 

                # Define the trigger policy for the window 
                # Every 5th record will execute the join
                rml:trigger [ a rml:CountPolicy
                    rml:countValue  5;

                ]; 

                # Define the eviction policy for the window
                # Clean up window after processing the 15th record
                rml:evict [ a rml:CountPolicy;
                    rml:countValue  15;
                ];

            ];
            rml:parentTriplesMap <#SM2>; 
            rr:joinCondition [
                rr:child "id";
                rr:parent "p_id"; 
            ];
        ];
    ]. 

<#TM2> 
    rml:logicalSource <#STREAM2> ;
    rml:subjectMap <#SM2> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM2> ;
        rml:objectMap <#OM2> ] .

(Data) Errors

The Data Errors subsection of [R2]RML core specification might need some updating, perhaps also considering revised test cases.

parent and child terminology

R2RML specification names the rr:parent and rr:child as parent query and child query, would we keep this terminology?

Or should we skip the query part and keep them only parent and child? I have the impression when we talk we typically use these terms.

or something else? if so, what?

Transformation function over joined sources?

Having two sources (A and B) that can be joined by a field, would be possible to apply functions on some B fields and using them in any part of the TriplesMap from A?

Let me put an example.
Table A:

AC1 AC2
a a1
b b1

Table B

BC1 BC2
a1 "hello a"
b1 "hello b"

Output (applying uppercase to BC2)

<http://example.org/a> ex:predicate "HELLO A"^^xsd:string
<http://example.org/b> ex:predicate "HELLO B"^^xsd:string

I don't know if it is currently possible declaring the join-condition in the mapping rules or should we have to create an ad-hoc implemented function?

Issues in test cases

  • 2c --> IDs does not exist in the source and the mapping should not generate an output
  • 4b --> subjectMap is term type literal and the mapping should not generate an output
  • 15b --> "spanish" is considered well-formed (not valid) as it falls under the 5*8ALPHA (src: https://www.rfc-editor.org/rfc/rfc5646). I would not vouch for using a lit of valid tags, but may restrict the spec in explicitly referring to the first language rule: language = 2*3ALPHA["-" extlang]
  • 19a --> assumes the base IRI is the baseIRI of the mapping, but we've said that the baseIRI is provided as an argument (and there is a proposal to assign baseIRIs to triples maps (@dachafra).
  • 19b --> should yield an error. The nq file contains no triple for "Juan Daniel," but that triple cannot be generated. IRI-safe values are only generated for templates, not references. At least, that is the case for R2RML (it is not specified in core). "R2RML always performs percent-encoding when IRIs are generated from string templates. If IRIs need to be generated without percent-encoding, then rr:column should be used instead of rr:template, with an R2RML view that performs the string concatenation." --> implies no percent encoding when using column/reference.
  • 20b should yield an error, as http://example.com/base/path/../Danny is not an absolute IRI.
  • There are no (simple) datatype map tests?
  • 7h should not have .nq files
  • 2g should not have .nq files (missed this, as there is no 2g for CSV files)
  • 10c-JSON -> should be "\\{\\{\\{ {$.['ISO 3166']} \\}\\}\\}"
  • 21a-JSON -> contains an additional POM that is not reflected in the output. 21a CSV and MySQL do not have that POM
  • 2c -> should not have output files

identifying blank nodes without an `rr:template` or `rml:reference`

In Section 4.1 Identifying collections is stated:

If no rr:template or rml:reference is provided for generating blank node IDs (rr:BlankNode) or IRIs (rr:IRI), then each iteration generates a new blank node identifier for the collection or container.

I think there is a general issue that we need to decide on here which is: Do we allow the generation of blank nodes without an explicit expression (rr:template, rml:reference, or other)?

Generating a random blank node id has impact on how you can implement joins on those terms, because there is no way to make the id generation repeatable. The R2RML spec does not allow blank node generation without an rr:template or rr:column, I assume for this reason.

So if we would want to allow this we would either have to:

  • Force processors into a way of implementing joins.
  • Have processors issue warnings when such a mapping is encountered, but it cannot guarantee correct results.
  • ?

Validity of template

I believe that checking the validity of templates should be included in the shapes. I'm not sure whether SPARQL's regular expressions allow for recursion, but it can be achieved by:

  • removing the unescaped curly braces from the template
  • checking whether the resulting string matches ^[^\{\}]*(?:\{[^\{\}]+\}[^\{\}]*)*$ (balanced and not nested)

This can be achieved for a SPARQL constraint component.

FnO output specification?

Just to summarize the current discussions and make sure I understand: the problem is that if a function is defined to return multiple outputs, FNML is ambiguous in which output to use.
Right now, implicitly, we assume to always take the first output. At the very least, this should be explicitly stated in the specification.
HOWEVER, that still gives problems if at some point you want to use the second output of a function :).
So we need a way to specify the output of the function, and Sam suggested using the SubjectMap for this.
(If above doesn't reflect the correct reasoning, please correct me and ignore my suggestions below ;) )

I personally think that there are the following options for specifying the output of the function:

  • explicitly state that the subjectmap within a FunctionTripleMaps has a quite different definition than subjectmaps within regular TripleMaps, namely smth like "The subjectMap of a FunctionTriplesMap generates the reference to the needed output of the function". I'm not for this option, since it's basically a redefinition and also hinders the generation of provenance data in the long run
  • provide a separate OutputMap within a FunctionTripleMaps with the definition "The outputMap of a FunctionTriplesMap generates the reference to the needed output of the function". The subjectMap's definition is untouched and can be used for provenance generation, and you can specify the output of the function
  • Specify this outputMap on the level of FunctionTermMap instead of FunctionTriplesMap (so you, e.g., use multiple outputs of the same FunctionTriplesMap for different (regular) TermMaps)

I'm in favor of this last option, see example below for what this entails

# Function description #

<ex:parseName> a fno:Function
  fno:expects ( [ fno:predicate ex:inputString ] )
  fno:returns ( [ fno:predicate ex:firstName] [fno:predicate ex:lastName ] ) .

# Mapping #

<#Person_Mapping>
    rml:logicalSource <#LogicalSource> ;                  # Specify the data source
    rr:subjectMap <#SubjectMap> ;                         # Specify the subject
    rr:predicateObjectMap <#FirstNameMapping> ;                # Specify the predicate-object-map
    rr:predicateObjectMap <#LastNameMapping> ,                # Specify the predicate-object-map

<#FirstNameMapping>
    rr:predicate foaf:firstName ;                              # Specify the predicate
    rr:objectMap <#FunctionTermMapFirstName> .                         # Specify the object-map
    
<#LastNameMapping>
    rr:predicate foaf:lastName ;                              # Specify the predicate
    rr:objectMap <#FunctionTermMapLastName> .                         # Specify the object-map

<#FunctionTermMapFirstName>
    fnml:functionValue parseNameFunctionTriplesMap ;
    fnml:outputValue ex:firstName .

<#FunctionTermMapLastName>
    fnml:functionValue parseNameFunctionTriplesMap ;
    fnml:outputValue ex:lastname .

<#parseNameFunctionTriplesMap>
    a fnml:FunctionTriplesMap ;
    rr:predicateObjectMap [
        rr:predicate fno:executes ;                   # Execute the function
        rr:objectMap [ rr:constant ex:parseName ] # ex:parseName
    ] ;
    rr:predicateObjectMap [
        rr:predicate ex:inputString ;
        rr:objectMap [ rr:reference "name" ]          # Use as input the "name" reference
    ] .

# When given the reference "name", e.g. value "Ben De Meester", this functiontriplesMap will generate following triples:
_:a # Blank node, because no SubjectMap is given
  fno:executes ex:parseName ;
  ex:inputString "Ben De Meester" .

# After execution, following triples will be generated
_:a # Same blank node
  ex:firstName "Ben" ;
  ex:lastName "De Meester" .

The RML namespaces do not dereference

Observation

Some namespaces that are used in RML configuration files do not dereference. It is therefore not possible to obtain an RDF representation of these vocabularies.

IRI dereference is useful, since this allows vocabularies to be pulled into any standard-compliant environment, using a simple HTTP request.

MWE 1

The following cURL request reproduces the dereference that is performed by TriplyDB, but this should be very similar to how any other standards-conforming linked data client sends such requests:

 curl -vL -H 'Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7' 'http://semweb.mmlab.be/ns/ql#'
*   Trying 193.191.148.200:80...
* Connected to semweb.mmlab.be (193.191.148.200) port 80
> GET /ns/ql HTTP/1.1
> Host: semweb.mmlab.be
> Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7
>
< HTTP/1.1 404 Not Found
< Server: nginx/1.14.0 (Ubuntu)
< Date: Sun, 11 Feb 2024 08:55:31 GMT
< Content-Type: text/html; charset=utf-8
< Content-Length: 12
< Connection: keep-alive
< X-Powered-By: Express
< ETag: "703595115"

Notice that the 'ql' vocabulary does not exist.

MWE 2

The following cURL request reproduces the dereference that is performed by TriplyDB, but this should be very similar to how any other standards-conforming linked data client sends such requests:

curl -vL -H 'Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7' 'http://www.w3.org/ns/r2rml#' > aap
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying [2606:4700::6812:1613]:80...
* Connected to www.w3.org (2606:4700::6812:1613) port 80
> GET /ns/r2rml HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/8.4.0
> Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7
>
< HTTP/1.1 301 Moved Permanently
< Date: Sun, 11 Feb 2024 09:05:08 GMT
< Transfer-Encoding: chunked
< Connection: keep-alive
< Cache-Control: max-age=3600
< Expires: Sun, 11 Feb 2024 10:05:08 GMT
< Location: https://www.w3.org/ns/r2rml
< Set-Cookie: __cf_bm=jBbWIn71PDCr7f80XLmWc0dTMUnSLHJwXOt9OWTrpKc-1707642308-1-AetC7y7UHMuoI4vdIMnsELUEU6fAEyQalSKFTSyBD4x4rsb61a8khjk+oPEBlmnXo79h7d6zSAwZdHXwomNgAW4=; path=/; expires=Sun, 11-Feb-24 09:35:08 GMT; domain=.w3.org; HttpOnly; SameSite=None
< Server: cloudflare
< CF-RAY: 853b6da918b40e40-AMS
< alt-svc: h3=":443"; ma=86400
<
* Ignoring the response-body
{ [5 bytes data]
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Connection #0 to host www.w3.org left intact
* Clear auth, redirects to port from 80 to 443
* Issue another request to this URL: 'https://www.w3.org/ns/r2rml'
*   Trying [2606:4700::6812:1613]:443...
* Connected to www.w3.org (2606:4700::6812:1613) port 443
* schannel: disabled automatic use of client certificate
* ALPN: curl offers http/1.1
* ALPN: server accepted http/1.1
* using HTTP/1.1
> GET /ns/r2rml HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/8.4.0
> Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7
>
* schannel: remote party requests renegotiation
* schannel: renegotiating SSL/TLS connection
* schannel: SSL/TLS connection renegotiated
* schannel: failed to decrypt data, need more data
< HTTP/1.1 200 OK
< Date: Sun, 11 Feb 2024 09:05:08 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< content-location: r2rml.html
< vary: negotiate,accept,Accept-Encoding
< tcn: choice
< last-modified: Mon, 17 Sep 2012 15:21:58 GMT
< etag: W/"1818d-4c9e7559fb180;a0-4939a0734f380
< Cache-Control: max-age=21600
< expires: Sun, 11 Feb 2024 15:05:08 GMT
< x-backend: www-mirrors
< x-request-id: 853b6da9bc5d6572
< strict-transport-security: max-age=15552000; includeSubdomains; preload
< content-security-policy: frame-ancestors 'self' https://cms.w3.org/; upgrade-insecure-requests
< CF-Cache-Status: BYPASS
< Set-Cookie: __cf_bm=7cDNNh9LZo1Y8n8aNwGfBd8DVwH3YTA9o5ef8kMnAXs-1707642308-1-ATwrFtMqqk6DhOQy6Oc0oBj4wSERNIUTH6h+x4xfuHHtbtg52f2QT6kkJ8dRknK0TXoyaik1f7/Vg6N87o1hMaE=; path=/; expires=Sun, 11-Feb-24 09:35:08 GMT; domain=.w3.org; HttpOnly; Secure; SameSite=None
< Server: cloudflare
< CF-RAY: 853b6da9bc5d6572-AMS
< alt-svc: h3=":443"; ma=86400
<
{ [11336 bytes data]
100 98701    0 98701    0     0   259k      0 --:--:-- --:--:-- --:--:--     0
* Connection #1 to host www.w3.org left intact

Notice that the 'rr'/'r2rml' vocabulary exists, but is not available in an RDF serialization format.

MWE 3

 curl -vL -H 'Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7' 'http://www.w3.org/ns/csvw#' > aap
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying [2606:4700::6812:1713]:80...
* Connected to www.w3.org (2606:4700::6812:1713) port 80
> GET /ns/csvw HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/8.4.0
> Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7
>
< HTTP/1.1 301 Moved Permanently
< Date: Sun, 11 Feb 2024 09:32:16 GMT
< Transfer-Encoding: chunked
< Connection: keep-alive
< Cache-Control: max-age=3600
< Expires: Sun, 11 Feb 2024 10:32:16 GMT
< Location: https://www.w3.org/ns/csvw
< Set-Cookie: __cf_bm=iOe70axn1ua.4ohv_Y.cH9yRby0WSFMGAjmzgJyrYKU-1707643936-1-AcNjXt1N40OIS6F0aOVweENzSjT8ag0qSDRRNJusBhq5DHAXg0rRJGOInPYLu45zM7SjwJI50Kqq6cuhwZXN2P8=; path=/; expires=Sun, 11-Feb-24 10:02:16 GMT; domain=.w3.org; HttpOnly; SameSite=None
< Server: cloudflare
< CF-RAY: 853b956df9e466f7-AMS
< alt-svc: h3=":443"; ma=86400
<
* Ignoring the response-body
{ [5 bytes data]
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Connection #0 to host www.w3.org left intact
* Clear auth, redirects to port from 80 to 443
* Issue another request to this URL: 'https://www.w3.org/ns/csvw'
*   Trying [2606:4700::6812:1713]:443...
* Connected to www.w3.org (2606:4700::6812:1713) port 443
* schannel: disabled automatic use of client certificate
* ALPN: curl offers http/1.1
* ALPN: server accepted http/1.1
* using HTTP/1.1
> GET /ns/csvw HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/8.4.0
> Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7
>
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* schannel: remote party requests renegotiation
* schannel: renegotiating SSL/TLS connection
* schannel: SSL/TLS connection renegotiated
* schannel: failed to decrypt data, need more data
< HTTP/1.1 200 OK
< Date: Sun, 11 Feb 2024 09:32:17 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< content-location: csvw.html
< vary: negotiate,accept,Accept-Encoding
< tcn: choice
< last-modified: Mon, 08 Oct 2018 10:13:20 GMT
< etag: W/"18ee4-577b4ded6f000;9f-50ab8a8466840
< Cache-Control: max-age=21600
< expires: Sun, 11 Feb 2024 15:32:17 GMT
< x-backend: www-mirrors
< x-request-id: 853b956eaa726621
< strict-transport-security: max-age=15552000; includeSubdomains; preload
< content-security-policy: frame-ancestors 'self' https://cms.w3.org/; upgrade-insecure-requests
< CF-Cache-Status: BYPASS
< Set-Cookie: __cf_bm=WEw9_l4uyxFUnuconnVFa9rHaXt6lr61F5IPvGJKtxg-1707643937-1-AUsPiDkr+ZoFfbXbGRap+pX5GEIGjZcJoC6bNoGr5ornXc+l7FchvMzFwb74Iu1lN0rmoUa9v5Fl9ZY3vGIO/pE=; path=/; expires=Sun, 11-Feb-24 10:02:17 GMT; domain=.w3.org; HttpOnly; Secure; SameSite=None
< Server: cloudflare
< CF-RAY: 853b956eaa726621-AMS
< alt-svc: h3=":443"; ma=86400
<
{ [25027 bytes data]
100   99k    0   99k    0     0   210k      0 --:--:-- --:--:-- --:--:--  727k
* Connection #1 to host www.w3.org left intact
*

Notice that 'csvw' only exists in HTML (but not in any RDF format).

Expected

All vocabularies that are commonly used in RML configuration files to be available through IRI dereferencing.

logical source reference from the core part of the spec

The core description has references to some input data but if we split the logical source and core part of the spec, how should we handle the references of the examples at the core part?

Should we introduce an abstract running example? or should we still have examples for all exemplary data sources?

Join specification when logical source is the same

Let's say we have two triple maps that refer to the same logical source (and with same, we really mean same URI, not "same because the descriptions lead to the semantically same logical source").

Sample source (CSV)

id,parent_id
1,2
2,1

Base mapping (YARRRML)

prefixes
  ex: http://example.com#
sources:
  test: [data.csv]
mappings:
  test1:
    s: ex:$(id)
    po:
      p: ex:parent
      o:
        mapping: test2
  test2:
    s: ex:$(parent_id)

We have following use cases that are underspecified in de spec

the spec currently says If the logical source of the child triples map and the logical source of the parent triples map of a referencing object map are not identical, then the referencing object map must have at least one join condition.

  1. If a join condition is specified AND the logical source is not the same: common case, execute join condition between each iteration pair
  2. If a join condition is specified AND the logical source is the same: same as above
  3. If no join condition is specified AND the logical source is not the same: do a full join (i.e., take all iterations into account)
  • example output: ex:1 ex:parent ex:2, ex:1 ex:parent ex:1, ex:2 ex:parent ex:2, ex:2 ex:parent ex:1
  1. If no join condition is specified AND the logical source is the same: don't do a full join, but take the current iteration into account
  • example output: ex:1 ex:parent ex:2, ex:2 ex:parent ex:1
  • this last one is the edge case, but allows to 'join per iteration'. Question is: should we make this edge case explicit, or should there be a different way to tackle this edge case?

FnO - Does a Function Triples Map need a Logical Source?

The current use case for a LogicalSource definition on a FunctionTriplesMap seems to be:

The ability to generate values from a different source and use these values as the result of a Function Term Map.

An example of this is included in one of the proposed FnO test cases: RMLFNOTC009

However, since a FunctionTriplesMap doesn't generate values directly, but generates intermediate function execution triples expressed in FnO, the question of how to handle joins between a TriplesMap and a FunctionTriplesMap with a different LogicalSource arises.

As this is not the same type of join as a join on a RefObjectMap this join would have to be defined. Subsequently, this would require another specific type of join to be implemented by engines.

At the same time we have a very similar mapping challenge for generating literal values by a joining different logical sources: join-on-literal challenge.

I believe it would be advantageous to come up with a solution that covers both generating literals from different LogicalSources using joins, as generating function values from different LogicalSources.
As this solution would not be specific to functions, I think we should look for a solution in the definition of LogicalSources. (pinging @thomas-delva)

Add section for implementation considerations

There are several considerations that can be made when implementing RML+FnO. Some of these might need to be specified in their own section, others might be too specific too certain implementations, but might still be interesting to mention. For now we can add a section 'Implementation considerations' to collect these, as discussed in this slack thread

Examples of issues:

  • Handling programming language specific datatypes
  • Special handling of RDF term types in used software library (e.g. from RDFJS, JENA, etc)
    • e.g. what if a function returns an RDF literal with language tag?
  • What should an engine do with the generated FnO execution triples (from the function triples map)?

Mistakes in the shapes

  • rml:cartessianProduct should be rml:cartesianProduct (2 occurences)
  • The message for RMLLanguageMapShape seems incorrect: "rml:LanguageMap must specify an rml:template or rml:constant with the IRI of the language." --> Languages do not have an IRI.

CI/CD workflows for spec repositories

Proposal sketched up with @DylanVanAssche for CI/CD workflows for the spec repositories, to keep specs and ontology/shapes in line, and also do some automatic validations / generation where applicable.

shapes coverage test

Create a GitHub action to be used by all spec repo's that contains generic CI/CD functionality.
As input the action 0 or more other spec-repositories can be specified on which this spec-repo is dependent.

The action then collects the necessary artefacts from the repo's and executes the following steps:

  1. Syntax-validation/parse ontologies (if error is encountered give clear msg with offending repo and file)
  2. Validate / Test shapes
  3. Validate examples

All spec repo's get the following directory structure

[spec-repo]
├── model/
│   ├── ontology/       # Ontology reflecting the spec ( this will be combined with rml-core and possibly other specs )
│   └── shapes/         # Shapes reflecting the spec
│       └── tests/      # Test cases which cover the shapes
└── examples/           # Examples used in the spec

Under model/shapes/tests we have test cases that cover the shapes reflecting the spec developed in the current repo.
This helps us ensure that our shapes and ontology are valid and stay up to date with the spec.

Under examples/ we have all examples that are used in the text of the spec. We place them in this standard location
so that they can also be validated against the shapes used in the previous step.

On each push to any branch, these tests will be run.

PR templates

We will introduce PR templates including a checklist which reminds us to:

  • make sure that the model is in sync with the spec.

Generate combined model

We can introduce a merge action which triggers a commit or PR to update the full model (in rml-core repo?)

Templates, safe separators, and null-values

TL;DR

R2RML defines a safe separator that should be used when a template contains more than one value/column reference. This issue discusses the use case for this, how null values in value references in templates should be handled, and tries to determine how current (R2)RMLprocessors handle this.

Description

In the definition of templates in R2RML the following is stated:

If a template contains multiple pairs of unescaped curly braces, then any pair SHOULD be separated from the next one by a safe separator. This is any character or string that does not occur anywhere in any of the data values of either referenced column; or in the IRI-safe versions of the data values, if the term type is rr:IRI (see note below).

A few things of note here:

  • First off, the usage of keyword SHOULD instead of MUST. So engines can deviate from this for valid reasons in particular cirumstances? What are those reasons?

  • So a template like "http://example.com/{}{}" SHOULD actually not be used/allowed.

  • How does one know upfront what "character or string that does not occur anywhere in any of the data values of either referenced column; or in the IRI-safe versions of the data values"? Or should this be validated upon evaluating the template?

  • This doesn't seem to be covered in the R2RML test cases, other than that all templates with with multiple pairs of unescaped curly braces do seem to have a separator.

  • What happens when one of the references in a template with multiple pairs of unescaped curly braces is NULL? The specified algorithm is actually not clear on this:

    1. Let result be the template string
    2. For each pair of unescaped curly braces in result:
        1. Let value be the data value of the column whose name is enclosed in the curly braces
        2. If value is NULL, then return NULL
        3. Let value be the natural RDF lexical form corresponding to value
        4. If the term type is rr:IRI, then replace the pair of curly braces with an IRI-safe version of value; otherwise, replace the pair of curly braces with value
    3. Return result
    

    Does returning null at 2.2 mean returning null for evaluation of the entire template? Or just for that pair of curly braces? And if the latter, why mention this at all in the algorithm?

    However if the former, why would there be a need for a safe separator?

The only use case I can imagine for a safe separator is to not get clashes when a referenced value in a template with multiple referenced values is empty.

Given

A B
A ~
A~ NULL

, with a template without a safe separator "http://example.com/{A}{B}" and not returning null for the template when one of its referenced values is null, this would result in

  • http://example.com/A~
  • http://example.com/A~

, a (undesired??) clash.

With a template with a safe separator "http://example.com/{A}-{B}" and not returning null for the template when one of its references values is null, this would result in

  • http://example.com/A-~
  • http://example.com/A~-

, no clash.

Of course with or without a safe separator, if a template should be evaluated to null if one of its referenced values is null, in both cases the result for "http://example.com/{A}{B}" would be:

  • http://example.com/A~

However, safe separators don't actually solve clashes when the same value reference can contain empty strings and NULLs. Given:

A B
A ~
A~ NULL
A~

In the first case you'd get all clashes:

  • http://example.com/A~
  • http://example.com/A~
  • http://example.com/A~

In the second case you'd get one clash:

  • http://example.com/A-~
  • http://example.com/A~-
  • http://example.com/A~-

In the third case (although it probably depends on the processor) you'd get one clash:

  • http://example.com/A~
  • http://example.com/A~

Questions

  1. Is there a use case to generate a value for a template when one of its referenced values is null?

    a. If yes, what happens if all referenced values are null?

  2. Do we want to maintain safe separators?

  3. How do current (R2)RML processors handle this? (Since this is not covered in the R2RML test cases)

  4. Is there another use case for safe separators that I'm missing?

Overview template implementation in current processors

Processor Requires safe separator in templates Returns value for templates with one or more null references
CARML NO NO
RMLMapper NO NO
Morph-KGC NO NO
SDM-RDFizer NO NO
Ontop NO NO
R2RML-F NO NO

Specify rr:parent as a shortcut reference property of rr:parentMap (and similar for rr:child as shortcut reference property for rr:childMap)

The RMLMapper supports rml:parentTermMap: https://github.com/RMLio/rmlmapper-java/search?q=parentTermMap

:om_5 a rr:ObjectMap;
    rr:parent "friendID".

EQUALS

:om_5 a rr:ObjectMap;
    rml:parentTermMap :ptm_0.
:ptm_0 rml:reference "friendID".

This allows us to support joining on, e.g. constants (which is interesting when you want to join on IDs AND, e.g. filter on constant values in the join condition) or templates or function values.
It's a subtle extension, but greatly increases the complexity of the join operator.
Even if joining in RML gets a revamp, this is mostly a suggestion to allow for term maps instead of only references.

function map definition

@andimou 's comment (putting it here so the discussion doesn't get lost):

I think when we defined this back in the days we had in mind a superclass of Term Map and Function Map as it might be a bit incorrect to say that the Function Map is a type of Term Map, because a Term Map expects at least one of the reference/template/constant or none but a term type to specify it's a blank node, whereas the Function Map as you also mention, expects a triples map.

In principle it's something like, an RDF term is generated by

  • a Term Map, which may be
    • reference-based
    • constant-based
    • template-based
    • none of the above but with term type blank node
  • Function Map
    ie a function map is an alternative to what nowadays is a term map

(r I don't remember correct?)

subject is IRI?

The R2RML spec says

The subjects often are IRIs that are generated from the primary key column(s) of the table.

does this often refer to the IRIs or the primary key column?

I would think that this would be translate to

The subjects often are IRIs that are generated from the primary key column(s) of the table.

or could they be anything else as well?

Ability to generate URI terms next to IRI terms

In a recent discussion in kg-construct/rml-questions#28 that it might be useful to support generation of URIs next to IRIs would be to facilitate Linked Data dereferencing. As HTTP only supports URIs, implementing IRI dereferencing inevitably requires IRI to URI mapping below the surface (For example how DBpedia did it). It might therefor be valuable to also be able to generate URIs.

Note by @DylanVanAssche on this:

Maybe we need to have rr:IRI and rr:URI in the new spec as rr:termType?
This way, mapping rules can explicitly being clear about this.

Is this indeed a useful enough feature to add to RML?

no section of default mapping generator in RML-core

The R2RML spec includes the following paragraph for duplicates handling:

Duplicate row preservation: For tables without a primary key, the Direct Graph requires that a fresh blank node is created for each row. This ensures that duplicate rows in such tables are preserved. This requirement is relaxed for R2RML default mappings: They MAY reuse the same blank node for multiple duplicate rows. This behaviour does not preserve duplicate rows. R2RML default mapping generators that provide default mappings based on the Direct Graph MUST document whether the generated default mapping preserves duplicate rows or not.

This needs to be adjusted in the case of heterogeneous data

Extending usage of GraphMap at PredicateMap and ObjectMap level

Issue

Currently in the proposed spec, rml:graphMap is only usable under PredicateObjectMap and SubjectMap.

The current spec does not allow fine tuning where you want to put the named graphs at the predicate/object level without considering things at the PredicateObjectMap level.
In my opinion, enabling usage of rml:graphMap at all term map levels (SubjectMap, PredicateMap, and ObjectMap) will give more control to the user to decide, which graph map the output will belong to, at the term level.
It will also reduce redundancy when writing RML documents by keeping closely related POMs at one place instead of being spread out all over the document.

Example

Suppose you want all triples with predicate <predicate AB> to be in the named graphs <graph A> and <graph B>.
But, you want only <predicate C> in named graph <graph C> with the same object <object>.

<subject> <predicateAB> <object> <graphA>.
<subject> <predicateAB> <object> <graphB>.
<subject> <predicateC> <object> <graphC>.

Current RML mapping to achieve this:

  rml:subjectMap [
    rml:template "subject"
  ];
 
    rml:predicateObjectMap [
      rml:predicateMap [
          rml:constant ex:predicateAB
      ];
     rml:graph   ex:graphA;
     rml:graph   ex:graphB;
      rml:object  "object"
      ];

    rml:predicateObjectMap [        
        rml:predicateMap [
            rml:constant ex:predicateC;         
        ];
        rml:graph   ex:graphC;
        rml:object  "object"
    ].

If GraphMaps could be defined at PredicateMap level, the following would be possible:

  rml:subjectMap [
    rml:template "subject"
  ];
 
    rml:predicateObjectMap [
      rml:predicateMap [
          rml:constant ex:predicateAB;
          rml:graph  ex:graphA; 
          rml:graph  ex:graphB;
     ];
      rml:predicateMap [
          rml:constant  ex:predicateC; 
          rml:graph ex:graphC; 
      ];
           rml:object  "object"
      ].

This also better aligns with FnO where FunctionMaps could be applied at every TermMap level (SubjectMap, PredicateMap, and ObjectMap)

Desired solution

image

rr:template: also provide an URI-unsafe alternative?

Right now, termmaps with rml:template and termtype rml:IRI always does percent encoding and there is a note that non-encoded values can only be achieved using rml:reference and thus preprocessing is needed spec. Alternatively, a FnO string concatenation function could be used.

However, this turns out to be a very common case, e.g. creating a mailto: resource from an emailadres: you just want to be able to do rml:template "mailto:{mailadres}", but that doesn't work because the @ will be percent-encoded (see RMLio/rmlmapper-java#219 (comment) for a recent similar request).

Does it make sense to introduce a rml:templateUnsafe predicate or similar to have an easy-to-use construct for handling this common request? This does the same as rml:template, but never performs a reference value transforming function

predicate could be a blank node?

The RML spec states the following:

Both predicate maps and object maps are term maps.

But a term map can be a blank node.

Would this be "misleading"?
Does this mean that a predicate could be a blank node?
Obviously, this is not correct: a predicate should only be an IRI.

image

Separate sections for ExpressionMap and TermMaps in spec

Currently in the spec, the Expression Map is introduced as a subsection of the Term Maps section. This is strange, since it is something that is broader than a term map and is also used as a super class by non-term map constructs.

Furthermore, the description of Term Maps and Expression Map is combined in sections like "Reference-valued Expression Maps and term types".
The spec also currently states things like:

A reference-valued Expression Map generates an RDF Term which is by default a Literal.

This has deviated significantly from how it was described at first and not as originally intended. The idea was to introduce expression maps not as generating RDF terms, but simply as generating values on an input source using an expression.
This to allow it to be used for more things than only generating terms. For example join conditions.

I think the current mixing of Term Maps and expression map makes the spec difficult to follow. Also, it makes it harder for the other documents that need to reference sections about Expression Map aspects without Term Map aspects to do so.

I would prefer to see the Expression Map first introduced in a separate section as abstract functions on the input source, followed by a section about term maps which then concretely describes.

Some bad hyperlinks in the spec

For instance, in Section 2.2 Mapping Graphs and the RML Vocabulary, there are a bunch of links to the spec and not a specific source. Specific nodes cannot be found in the spec either (e.g., Logical Source).

Joins in RML

How joins are currently handled:

We refer to a Parent Triples Map but in fact we use the Subject Map of the Triples Map.

<#TM1> 
    rml:logicalSource <LS1> ;
    rml:subjectMap <#SM1> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM1> ;
        rml:objectMap [ rml:parentTriplesMap <#SM2> ] ]. 

<#TM2> 
    rml:logicalSource <LS2> ;
    rml:subjectMap <#SM2> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM2> ;
        rml:objectMap <#OM2> ] .

Limitations:

This limits us because a Subject Map can only be an IRI or a Blank Node but it cannot be a literal.

It also limits us as we can only join the object of the one triple with the subject of another triple.

Solution:
Actually refer to the RDF term which we want to reuse instead of the Triples Map.

In this case, the above would become:

<#TM1> 
    rml:logicalSource <LS1> ;
    rml:subjectMap <#SM1> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM1> ;
        rml:objectMap [ rml:parentTermMap <#SM2> ] ]. 

<#TM2> 
    rml:logicalSource <LS2> ;
    rml:subjectMap <#SM2> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM2> ;
        rml:objectMap <#OM2> ] .

(rml:parentTermMap would be introduced to refer to the Term we want to reuse)

We can reuse the above to even refer to the object of <#TM2>, e.g.:

<#TM1> 
    rml:logicalSource <LS1> ;
    rml:subjectMap <#SM1> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM1> ;
        rml:objectMap [ rml:parentTermMap <#OM2> ] ]. 

<#TM2> 
    rml:logicalSource <LS2> ;
    rml:subjectMap <#SM2> ;
    rml:predicateObjectMap [
        rml:predicateMap <#PM2> ;
        rml:objectMap <#OM2> ] .

and we can do that in any Term Map, e.g., in the following we join the Subject Map of the one TM with the Subject Map of the other TM:

<#TM1> 
    rml:logicalSource <LS1> ;
    rml:subjectMap [ rml:parentTermMap <#SM2> ] ;

<#TM2> 
    rml:logicalSource <LS2> ;
    rml:subjectMap <#SM2> .

That proposal would solve the issues with joins for literals but not the cases where we want to use data from two data sources as in kg-construct/rml-jc#3.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.