kg-construct / rml-core Goto Github PK
View Code? Open in Web Editor NEWRML-Core: Main features for RDF generation with RML
Home Page: https://w3id.org/rml/core/spec
License: Creative Commons Attribution 4.0 International
RML-Core: Main features for RDF generation with RML
Home Page: https://w3id.org/rml/core/spec
License: Creative Commons Attribution 4.0 International
The issue on joins is closed and a "referencing term maps" spec is mentioned, but where is it? The problem being is that the test cases still contain joins and some details need to be worked out.
I get what you mean, it's also inherited reasoning from RML, but if we want to finally be 100% correct, we do need a superclass that allows us to bring everything correct. Then we would have something like this
rml:newSuperTermMap
rml:TermMap
rr:TermMap
fnml:FunctionMap
where rml:TermMap
, rr:TermMap
and fnml:FunctionMap
are subclasses of the newSuperTermMap
and that would also make everything more correct because the Term Map is actually abused in the case of RML, as an rr:TermMap
should have one of column/template/reference and not reference that the rml:TermMap
should have.
RML was on purposed designed to abuse R2RML vocab, but as it is now, the cleanest solution would be to revise the RML vocabulary to correctly generalize R2RML.
If we make things valid, ie as explained above, then I think the best would be to straight go for a correct definition of fnml:FunctionMap
(it might even be like the following, ie the rr:TermMap
a subclass of rml:TermMap
but that might not matter for the discussion here
rml:newSuperTermMap
rml:TermMap
rr:TermMap
fnml:FunctionMap
)Originally posted by @andimou in #11 (comment)
should we generate a triple for empty literals?
if so, when should we do that?
Regarding the decision of which namespace should we use for the RML ontologies (core + modules), there are three options, and they also influence the use of hash or slash:
The R2RML spec states:
If the constant-valued term map is a subject map, predicate map or graph map, then its constant value must be an IRI.
If the constant-valued term map is an object map, then its constant value must be an IRI or literal.
Blank nodes are not part of the options. Why this limitation?
Do we want to maintain this?
Additional to #26 a reminder to add description of a template wich contains multiple references / expression that each return multiple values
SHACL shapes in #61 are hard to validate because there are no test cases.
Let's port them from RML testcases to the new specification and make sure they are covering the complete specification.
All rdfs:isDefinedBy
statements are using an incorrect ontology IRI.
<http://w3id.org/rml/core/>
is used, while the ontology IRI is <http://w3id.org/rml/core>
See e.g.
rml-core/ontology/documentation/ontology.ttl
Line 113 in 13653c8
while
There are a couple of MUST
statements in the spec that depend on the actual data that could be revisited to support some kind of 'lenient' mode of processing the mapping document.
For example: I have a mapping file with 10 triples maps that map 10 tables, which I use for mapping databases A and B (and many more others). 3 of those tables are optional in database B, crashing the mapping engine due to the fact that a logical source description MUST resolve to a data source (see eg R2RML spec)
The referenced columns of all term maps of a triples map (subject map, predicate maps, object maps, graph maps) MUST be column names that exist in the term map's logical table.
: don't crash if you don't have all references, but generate all othersIt's an open question on whether this is functionality of the mapping engine or a feature of the language, however I think it would be good to clarify which errors should result in exiting the engine vs handling gracefully
Many comment on RML as being verbose. RML mappings do not necessarily need to be crafted by hand, and multiple DSL-like languages have emerged, such as ShexML and YARRRML.
I propose to standardize (after RML is 'complete') a human-friendly way to edit RML documents, as these previous languages have shown that this will increase uptake of the spec.
Similar efforts: SHACL compact syntax
I see that we currently have the logical source cardinality to be equal to 1,
i.e. each TriplesMap should have exactly 1 Logical Source.
this holds in v1 of the shapes and the current version of the spec.
Question is: do we keep this? Should it always be 1? Should it be at least 1?
An aspect missing from RML is the ability to generate a triple using an inverse predicate.
Specifiying the generation of triples with an inverse predicate would allow you to generate a triple in the inverse direction so object -> predicate -> subject
in the context of the subject generating triples map.
Note: YARRRML also introduces inverse predicates.
Proposal:
rml:InversePredicateMap
which is a term map that can be referenced from a rml:PredicateObjectMap
using rml:inversePredicateMap
.rml:inversePredicate
for rml:inversePredicateMap
, analogous to rr:predicate
.rml:PredicateObjectMap
such that it should have at least one rml:PredicateMap
or rml:InversePredicateMap
. This will allow for maximum flexibility in use.rml:PredicateObjectMap
which also rererences an rml:InversePredicateMap
to IRIs and blank nodes.Example:
<someTriplesMap>
rml:logicalSource [] ;
rr:subject ex:subject ;
rr:predicateObjectMap [
rr:predicateMap [
rr:constant ex:parent ;
] ;
rml:inversePredicateMap [
rr:constant ex:child ;
] ;
rr:object ex:object ;
] ;
or using shortcut properties:
<someTriplesMap>
rml:logicalSource [] ;
rr:subject ex:subject ;
rr:predicateObjectMap [
rr:predicate ex:parent ;
rml:inversePredicate ex:child;
rr:object ex:object ;
] ;
would generate:
ex:subject ex:parent ex:object .
ex:object ex:child ex:subject .
is the description/definition of the base IRI sufficient as it comes from the R2RML spec or do we need to further clarify certain aspects of it?
It seems that Github Actions is doing difficult, lets fix that.
Currently, there is no way to define windowing semantics in RML.
Windowing is crucial when evaluating joins between different live streaming
data sources.
Furthermore, windowing could also support buffering capabilities for
aggregation functions when processing streaming data sources. For example,
calculating an average of the values over the last 5 minutes.
According to Gedik B.,
windows' behaviour is defined based on its type, and policies.
There are 2 main types of windows: tumbling, and sliding windows.
An illustration about how these windows work can be found here.
Note: Session window is a special case of tumbling window where the window
only gets dropped when inactivity threshold is violated.
The policies control when the windows evicts the tuples inside
the window (eviction policy), and when they triggers the processing of the
tuples using the operator logic defined inside the window (trigger policy).
Policies are further divided into 4 categories namely:
Thus, we need a set of vocabulary to define and configure windows by
describing:
The true semantics and combination of the policies are further explained by
Gedik B..
Given the following RML with a join condition:
<#TM1>
rml:logicalSource <#STREAM1> ;
rml:subjectMap <#SM1> ;
rml:predicateObjectMap [
rml:predicateMap <#PM1> ;
rml:objectMap [
rml:parentTriplesMap <#SM2>;
rr:joinCondition [
rr:child "id";
rr:parent "p_id";
];
];
].
<#TM2>
rml:logicalSource <#STREAM2> ;
rml:subjectMap <#SM2> ;
rml:predicateObjectMap [
rml:predicateMap <#PM2> ;
rml:objectMap <#OM2> ] .
Windows could be defined in the object map
<#TM1>
rml:logicalSource <#STREAM1> ;
rml:subjectMap <#SM1> ;
rml:predicateObjectMap [
rml:predicateMap <#PM1> ;
rml:objectMap [
# Define the window to be used for joining
rml:window [
# Define window types
rml:windowType rml:Tumbling;
# Define the trigger policy for the window
# Every 5th record will execute the join
rml:trigger [ a rml:CountPolicy
rml:countValue 5;
];
# Define the eviction policy for the window
# Clean up window after processing the 15th record
rml:evict [ a rml:CountPolicy;
rml:countValue 15;
];
];
rml:parentTriplesMap <#SM2>;
rr:joinCondition [
rr:child "id";
rr:parent "p_id";
];
];
].
<#TM2>
rml:logicalSource <#STREAM2> ;
rml:subjectMap <#SM2> ;
rml:predicateObjectMap [
rml:predicateMap <#PM2> ;
rml:objectMap <#OM2> ] .
The Data Errors subsection of [R2]RML core specification might need some updating, perhaps also considering revised test cases.
Do we already have any specifications in the case of having a list of output values? I think it falls more into the behavior of the engine but still makes sense to have the definition of the correct behavior.
Originally posted by @samiscoding in https://github.com/kg-construct/rml-fno-spec/issues/7#issuecomment-810846015
how do we handle multiple values in RML?
Related to https://github.com/kg-construct/mapping-challenges/tree/main/challenges/multivalue-references
R2RML specification names the rr:parent
and rr:child
as parent query and child query, would we keep this terminology?
Or should we skip the query part and keep them only parent and child? I have the impression when we talk we typically use these terms.
or something else? if so, what?
Having two sources (A and B) that can be joined by a field, would be possible to apply functions on some B fields and using them in any part of the TriplesMap from A?
Let me put an example.
Table A:
AC1 | AC2 |
---|---|
a | a1 |
b | b1 |
Table B
BC1 | BC2 |
---|---|
a1 | "hello a" |
b1 | "hello b" |
Output (applying uppercase to BC2)
<http://example.org/a> ex:predicate "HELLO A"^^xsd:string
<http://example.org/b> ex:predicate "HELLO B"^^xsd:string
I don't know if it is currently possible declaring the join-condition in the mapping rules or should we have to create an ad-hoc implemented function?
IDs
does not exist in the source and the mapping should not generate an output5*8ALPHA
(src: https://www.rfc-editor.org/rfc/rfc5646). I would not vouch for using a lit of valid tags, but may restrict the spec in explicitly referring to the first language rule: language = 2*3ALPHA["-" extlang]
http://example.com/base/path/../Danny
is not an absolute IRI."\\{\\{\\{ {$.['ISO 3166']} \\}\\}\\}"
Apparently the RDF specification allows it
In Section 4.1 Identifying collections is stated:
If no
rr:template
orrml:reference
is provided for generating blank node IDs (rr:BlankNode
) or IRIs (rr:IRI
), then each iteration generates a new blank node identifier for the collection or container.
I think there is a general issue that we need to decide on here which is: Do we allow the generation of blank nodes without an explicit expression (rr:template
, rml:reference
, or other)?
Generating a random blank node id has impact on how you can implement joins on those terms, because there is no way to make the id generation repeatable. The R2RML spec does not allow blank node generation without an rr:template
or rr:column
, I assume for this reason.
So if we would want to allow this we would either have to:
This is a follow up of kg-construct/rml-cc#19.
It's been agreed to create class rml:Strategy in rml-core, but keep the definition of instances rml:Append and rml:CartesianProduct in the collection-containers-spec.
I believe that checking the validity of templates should be included in the shapes. I'm not sure whether SPARQL's regular expressions allow for recursion, but it can be achieved by:
^[^\{\}]*(?:\{[^\{\}]+\}[^\{\}]*)*$
(balanced and not nested)This can be achieved for a SPARQL constraint component.
Just to summarize the current discussions and make sure I understand: the problem is that if a function is defined to return multiple outputs, FNML is ambiguous in which output to use.
Right now, implicitly, we assume to always take the first output. At the very least, this should be explicitly stated in the specification.
HOWEVER, that still gives problems if at some point you want to use the second output of a function :).
So we need a way to specify the output of the function, and Sam suggested using the SubjectMap
for this.
(If above doesn't reflect the correct reasoning, please correct me and ignore my suggestions below ;) )
I personally think that there are the following options for specifying the output of the function:
subjectmap
within a FunctionTripleMaps
has a quite different definition than subjectmap
s within regular TripleMaps
, namely smth like "The subjectMap
of a FunctionTriplesMap
generates the reference to the needed output of the function". I'm not for this option, since it's basically a redefinition and also hinders the generation of provenance data in the long runOutputMap
within a FunctionTripleMaps
with the definition "The outputMap
of a FunctionTriplesMap
generates the reference to the needed output of the function". The subjectMap
's definition is untouched and can be used for provenance generation, and you can specify the output of the functionoutputMap
on the level of FunctionTermMap
instead of FunctionTriplesMap
(so you, e.g., use multiple outputs of the same FunctionTriplesMap
for different (regular) TermMaps
)I'm in favor of this last option, see example below for what this entails
# Function description #
<ex:parseName> a fno:Function
fno:expects ( [ fno:predicate ex:inputString ] )
fno:returns ( [ fno:predicate ex:firstName] [fno:predicate ex:lastName ] ) .
# Mapping #
<#Person_Mapping>
rml:logicalSource <#LogicalSource> ; # Specify the data source
rr:subjectMap <#SubjectMap> ; # Specify the subject
rr:predicateObjectMap <#FirstNameMapping> ; # Specify the predicate-object-map
rr:predicateObjectMap <#LastNameMapping> , # Specify the predicate-object-map
<#FirstNameMapping>
rr:predicate foaf:firstName ; # Specify the predicate
rr:objectMap <#FunctionTermMapFirstName> . # Specify the object-map
<#LastNameMapping>
rr:predicate foaf:lastName ; # Specify the predicate
rr:objectMap <#FunctionTermMapLastName> . # Specify the object-map
<#FunctionTermMapFirstName>
fnml:functionValue parseNameFunctionTriplesMap ;
fnml:outputValue ex:firstName .
<#FunctionTermMapLastName>
fnml:functionValue parseNameFunctionTriplesMap ;
fnml:outputValue ex:lastname .
<#parseNameFunctionTriplesMap>
a fnml:FunctionTriplesMap ;
rr:predicateObjectMap [
rr:predicate fno:executes ; # Execute the function
rr:objectMap [ rr:constant ex:parseName ] # ex:parseName
] ;
rr:predicateObjectMap [
rr:predicate ex:inputString ;
rr:objectMap [ rr:reference "name" ] # Use as input the "name" reference
] .
# When given the reference "name", e.g. value "Ben De Meester", this functiontriplesMap will generate following triples:
_:a # Blank node, because no SubjectMap is given
fno:executes ex:parseName ;
ex:inputString "Ben De Meester" .
# After execution, following triples will be generated
_:a # Same blank node
ex:firstName "Ben" ;
ex:lastName "De Meester" .
Some namespaces that are used in RML configuration files do not dereference. It is therefore not possible to obtain an RDF representation of these vocabularies.
IRI dereference is useful, since this allows vocabularies to be pulled into any standard-compliant environment, using a simple HTTP request.
The following cURL request reproduces the dereference that is performed by TriplyDB, but this should be very similar to how any other standards-conforming linked data client sends such requests:
curl -vL -H 'Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7' 'http://semweb.mmlab.be/ns/ql#'
* Trying 193.191.148.200:80...
* Connected to semweb.mmlab.be (193.191.148.200) port 80
> GET /ns/ql HTTP/1.1
> Host: semweb.mmlab.be
> Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7
>
< HTTP/1.1 404 Not Found
< Server: nginx/1.14.0 (Ubuntu)
< Date: Sun, 11 Feb 2024 08:55:31 GMT
< Content-Type: text/html; charset=utf-8
< Content-Length: 12
< Connection: keep-alive
< X-Powered-By: Express
< ETag: "703595115"
Notice that the 'ql' vocabulary does not exist.
The following cURL request reproduces the dereference that is performed by TriplyDB, but this should be very similar to how any other standards-conforming linked data client sends such requests:
curl -vL -H 'Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7' 'http://www.w3.org/ns/r2rml#' > aap
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying [2606:4700::6812:1613]:80...
* Connected to www.w3.org (2606:4700::6812:1613) port 80
> GET /ns/r2rml HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/8.4.0
> Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7
>
< HTTP/1.1 301 Moved Permanently
< Date: Sun, 11 Feb 2024 09:05:08 GMT
< Transfer-Encoding: chunked
< Connection: keep-alive
< Cache-Control: max-age=3600
< Expires: Sun, 11 Feb 2024 10:05:08 GMT
< Location: https://www.w3.org/ns/r2rml
< Set-Cookie: __cf_bm=jBbWIn71PDCr7f80XLmWc0dTMUnSLHJwXOt9OWTrpKc-1707642308-1-AetC7y7UHMuoI4vdIMnsELUEU6fAEyQalSKFTSyBD4x4rsb61a8khjk+oPEBlmnXo79h7d6zSAwZdHXwomNgAW4=; path=/; expires=Sun, 11-Feb-24 09:35:08 GMT; domain=.w3.org; HttpOnly; SameSite=None
< Server: cloudflare
< CF-RAY: 853b6da918b40e40-AMS
< alt-svc: h3=":443"; ma=86400
<
* Ignoring the response-body
{ [5 bytes data]
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
* Connection #0 to host www.w3.org left intact
* Clear auth, redirects to port from 80 to 443
* Issue another request to this URL: 'https://www.w3.org/ns/r2rml'
* Trying [2606:4700::6812:1613]:443...
* Connected to www.w3.org (2606:4700::6812:1613) port 443
* schannel: disabled automatic use of client certificate
* ALPN: curl offers http/1.1
* ALPN: server accepted http/1.1
* using HTTP/1.1
> GET /ns/r2rml HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/8.4.0
> Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7
>
* schannel: remote party requests renegotiation
* schannel: renegotiating SSL/TLS connection
* schannel: SSL/TLS connection renegotiated
* schannel: failed to decrypt data, need more data
< HTTP/1.1 200 OK
< Date: Sun, 11 Feb 2024 09:05:08 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< content-location: r2rml.html
< vary: negotiate,accept,Accept-Encoding
< tcn: choice
< last-modified: Mon, 17 Sep 2012 15:21:58 GMT
< etag: W/"1818d-4c9e7559fb180;a0-4939a0734f380
< Cache-Control: max-age=21600
< expires: Sun, 11 Feb 2024 15:05:08 GMT
< x-backend: www-mirrors
< x-request-id: 853b6da9bc5d6572
< strict-transport-security: max-age=15552000; includeSubdomains; preload
< content-security-policy: frame-ancestors 'self' https://cms.w3.org/; upgrade-insecure-requests
< CF-Cache-Status: BYPASS
< Set-Cookie: __cf_bm=7cDNNh9LZo1Y8n8aNwGfBd8DVwH3YTA9o5ef8kMnAXs-1707642308-1-ATwrFtMqqk6DhOQy6Oc0oBj4wSERNIUTH6h+x4xfuHHtbtg52f2QT6kkJ8dRknK0TXoyaik1f7/Vg6N87o1hMaE=; path=/; expires=Sun, 11-Feb-24 09:35:08 GMT; domain=.w3.org; HttpOnly; Secure; SameSite=None
< Server: cloudflare
< CF-RAY: 853b6da9bc5d6572-AMS
< alt-svc: h3=":443"; ma=86400
<
{ [11336 bytes data]
100 98701 0 98701 0 0 259k 0 --:--:-- --:--:-- --:--:-- 0
* Connection #1 to host www.w3.org left intact
Notice that the 'rr'/'r2rml' vocabulary exists, but is not available in an RDF serialization format.
curl -vL -H 'Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7' 'http://www.w3.org/ns/csvw#' > aap
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying [2606:4700::6812:1713]:80...
* Connected to www.w3.org (2606:4700::6812:1713) port 80
> GET /ns/csvw HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/8.4.0
> Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7
>
< HTTP/1.1 301 Moved Permanently
< Date: Sun, 11 Feb 2024 09:32:16 GMT
< Transfer-Encoding: chunked
< Connection: keep-alive
< Cache-Control: max-age=3600
< Expires: Sun, 11 Feb 2024 10:32:16 GMT
< Location: https://www.w3.org/ns/csvw
< Set-Cookie: __cf_bm=iOe70axn1ua.4ohv_Y.cH9yRby0WSFMGAjmzgJyrYKU-1707643936-1-AcNjXt1N40OIS6F0aOVweENzSjT8ag0qSDRRNJusBhq5DHAXg0rRJGOInPYLu45zM7SjwJI50Kqq6cuhwZXN2P8=; path=/; expires=Sun, 11-Feb-24 10:02:16 GMT; domain=.w3.org; HttpOnly; SameSite=None
< Server: cloudflare
< CF-RAY: 853b956df9e466f7-AMS
< alt-svc: h3=":443"; ma=86400
<
* Ignoring the response-body
{ [5 bytes data]
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
* Connection #0 to host www.w3.org left intact
* Clear auth, redirects to port from 80 to 443
* Issue another request to this URL: 'https://www.w3.org/ns/csvw'
* Trying [2606:4700::6812:1713]:443...
* Connected to www.w3.org (2606:4700::6812:1713) port 443
* schannel: disabled automatic use of client certificate
* ALPN: curl offers http/1.1
* ALPN: server accepted http/1.1
* using HTTP/1.1
> GET /ns/csvw HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/8.4.0
> Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7
>
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* schannel: remote party requests renegotiation
* schannel: renegotiating SSL/TLS connection
* schannel: SSL/TLS connection renegotiated
* schannel: failed to decrypt data, need more data
< HTTP/1.1 200 OK
< Date: Sun, 11 Feb 2024 09:32:17 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< content-location: csvw.html
< vary: negotiate,accept,Accept-Encoding
< tcn: choice
< last-modified: Mon, 08 Oct 2018 10:13:20 GMT
< etag: W/"18ee4-577b4ded6f000;9f-50ab8a8466840
< Cache-Control: max-age=21600
< expires: Sun, 11 Feb 2024 15:32:17 GMT
< x-backend: www-mirrors
< x-request-id: 853b956eaa726621
< strict-transport-security: max-age=15552000; includeSubdomains; preload
< content-security-policy: frame-ancestors 'self' https://cms.w3.org/; upgrade-insecure-requests
< CF-Cache-Status: BYPASS
< Set-Cookie: __cf_bm=WEw9_l4uyxFUnuconnVFa9rHaXt6lr61F5IPvGJKtxg-1707643937-1-AUsPiDkr+ZoFfbXbGRap+pX5GEIGjZcJoC6bNoGr5ornXc+l7FchvMzFwb74Iu1lN0rmoUa9v5Fl9ZY3vGIO/pE=; path=/; expires=Sun, 11-Feb-24 10:02:17 GMT; domain=.w3.org; HttpOnly; Secure; SameSite=None
< Server: cloudflare
< CF-RAY: 853b956eaa726621-AMS
< alt-svc: h3=":443"; ma=86400
<
{ [25027 bytes data]
100 99k 0 99k 0 0 210k 0 --:--:-- --:--:-- --:--:-- 727k
* Connection #1 to host www.w3.org left intact
*
Notice that 'csvw' only exists in HTML (but not in any RDF format).
All vocabularies that are commonly used in RML configuration files to be available through IRI dereferencing.
Eventually we need to decided our IRI strategy for the different modules of the specifications.
Any suggestions?
The core description has references to some input data but if we split the logical source and core part of the spec, how should we handle the references of the examples at the core part?
Should we introduce an abstract running example? or should we still have examples for all exemplary data sources?
Let's say we have two triple maps that refer to the same logical source (and with same, we really mean same URI, not "same because the descriptions lead to the semantically same logical source").
Sample source (CSV)
id,parent_id
1,2
2,1
Base mapping (YARRRML)
prefixes
ex: http://example.com#
sources:
test: [data.csv]
mappings:
test1:
s: ex:$(id)
po:
p: ex:parent
o:
mapping: test2
test2:
s: ex:$(parent_id)
We have following use cases that are underspecified in de spec
the spec currently says If the logical source of the child triples map and the logical source of the parent triples map of a referencing object map are not identical, then the referencing object map must have at least one join condition.
ex:1 ex:parent ex:2
, ex:1 ex:parent ex:1
, ex:2 ex:parent ex:2
, ex:2 ex:parent ex:1
ex:1 ex:parent ex:2
, ex:2 ex:parent ex:1
The current use case for a LogicalSource
definition on a FunctionTriplesMap
seems to be:
The ability to generate values from a different source and use these values as the result of a Function Term Map.
An example of this is included in one of the proposed FnO test cases: RMLFNOTC009
However, since a FunctionTriplesMap
doesn't generate values directly, but generates intermediate function execution triples expressed in FnO, the question of how to handle joins between a TriplesMap
and a FunctionTriplesMap
with a different LogicalSource
arises.
As this is not the same type of join as a join on a RefObjectMap
this join would have to be defined. Subsequently, this would require another specific type of join to be implemented by engines.
At the same time we have a very similar mapping challenge for generating literal values by a joining different logical sources: join-on-literal challenge.
I believe it would be advantageous to come up with a solution that covers both generating literals from different LogicalSource
s using joins, as generating function values from different LogicalSource
s.
As this solution would not be specific to functions, I think we should look for a solution in the definition of LogicalSource
s. (pinging @thomas-delva)
There are several considerations that can be made when implementing RML+FnO. Some of these might need to be specified in their own section, others might be too specific too certain implementations, but might still be interesting to mention. For now we can add a section 'Implementation considerations' to collect these, as discussed in this slack thread
Examples of issues:
After our discussions, I created an Excel sheet where we can include all possible combinations.
The excel sheet is available at https://docs.google.com/spreadsheets/d/163K5XoeudPyvEWzlc9oTRqPwHCwGHUopu4mJMGPndR0/edit
I create this issue to provide the link and use it as reference for other issues that refer to this Excel sheet but discuss more detailed topics
We need to agree on how we are going to call the rml:reference
values.
Since we are adopting solution (3) from #56, we need to decide the namespace for each module. Here is a first proposal:
Any suggestion of change of any module is welcome
@DylanVanAssche and @pmaria should already have some shapes, let's start with these, what they cover and what they don't and how they need to be adjusted to fit to the new spec of RML.
Proposal sketched up with @DylanVanAssche for CI/CD workflows for the spec repositories, to keep specs and ontology/shapes in line, and also do some automatic validations / generation where applicable.
Create a GitHub action to be used by all spec repo's that contains generic CI/CD functionality.
As input the action 0 or more other spec-repositories can be specified on which this spec-repo is dependent.
The action then collects the necessary artefacts from the repo's and executes the following steps:
All spec repo's get the following directory structure
[spec-repo]
├── model/
│ ├── ontology/ # Ontology reflecting the spec ( this will be combined with rml-core and possibly other specs )
│ └── shapes/ # Shapes reflecting the spec
│ └── tests/ # Test cases which cover the shapes
└── examples/ # Examples used in the spec
Under model/shapes/tests
we have test cases that cover the shapes reflecting the spec developed in the current repo.
This helps us ensure that our shapes and ontology are valid and stay up to date with the spec.
Under examples/
we have all examples that are used in the text of the spec. We place them in this standard location
so that they can also be validated against the shapes used in the previous step.
On each push to any branch, these tests will be run.
We will introduce PR templates including a checklist which reminds us to:
We can introduce a merge action which triggers a commit or PR to update the full model (in rml-core repo?)
R2RML defines a safe separator that should be used when a template contains more than one value/column reference. This issue discusses the use case for this, how null values in value references in templates should be handled, and tries to determine how current (R2)RMLprocessors handle this.
In the definition of templates in R2RML the following is stated:
If a template contains multiple pairs of unescaped curly braces, then any pair SHOULD be separated from the next one by a safe separator. This is any character or string that does not occur anywhere in any of the data values of either referenced column; or in the IRI-safe versions of the data values, if the term type is rr:IRI (see note below).
A few things of note here:
First off, the usage of keyword SHOULD instead of MUST. So engines can deviate from this for valid reasons in particular cirumstances? What are those reasons?
So a template like "http://example.com/{}{}"
SHOULD actually not be used/allowed.
How does one know upfront what "character or string that does not occur anywhere in any of the data values of either referenced column; or in the IRI-safe versions of the data values"? Or should this be validated upon evaluating the template?
This doesn't seem to be covered in the R2RML test cases, other than that all templates with with multiple pairs of unescaped curly braces do seem to have a separator.
What happens when one of the references in a template with multiple pairs of unescaped curly braces is NULL
? The specified algorithm is actually not clear on this:
1. Let result be the template string
2. For each pair of unescaped curly braces in result:
1. Let value be the data value of the column whose name is enclosed in the curly braces
2. If value is NULL, then return NULL
3. Let value be the natural RDF lexical form corresponding to value
4. If the term type is rr:IRI, then replace the pair of curly braces with an IRI-safe version of value; otherwise, replace the pair of curly braces with value
3. Return result
Does returning null at 2.2 mean returning null for evaluation of the entire template? Or just for that pair of curly braces? And if the latter, why mention this at all in the algorithm?
However if the former, why would there be a need for a safe separator?
The only use case I can imagine for a safe separator is to not get clashes when a referenced value in a template with multiple referenced values is empty.
Given
A | B |
---|---|
A | ~ |
A~ | NULL |
, with a template without a safe separator "http://example.com/{A}{B}"
and not returning null for the template when one of its referenced values is null, this would result in
http://example.com/A~
http://example.com/A~
, a (undesired??) clash.
With a template with a safe separator "http://example.com/{A}-{B}"
and not returning null for the template when one of its references values is null, this would result in
http://example.com/A-~
http://example.com/A~-
, no clash.
Of course with or without a safe separator, if a template should be evaluated to null if one of its referenced values is null, in both cases the result for "http://example.com/{A}{B}"
would be:
http://example.com/A~
However, safe separators don't actually solve clashes when the same value reference can contain empty strings and NULL
s. Given:
A | B |
---|---|
A | ~ |
A~ | NULL |
A~ |
In the first case you'd get all clashes:
http://example.com/A~
http://example.com/A~
http://example.com/A~
In the second case you'd get one clash:
http://example.com/A-~
http://example.com/A~-
http://example.com/A~-
In the third case (although it probably depends on the processor) you'd get one clash:
http://example.com/A~
http://example.com/A~
Is there a use case to generate a value for a template when one of its referenced values is null?
a. If yes, what happens if all referenced values are null?
Do we want to maintain safe separators?
How do current (R2)RML processors handle this? (Since this is not covered in the R2RML test cases)
Is there another use case for safe separators that I'm missing?
Processor | Requires safe separator in templates | Returns value for templates with one or more null references |
---|---|---|
CARML | NO | NO |
RMLMapper | NO | NO |
Morph-KGC | NO | NO |
SDM-RDFizer | NO | NO |
Ontop | NO | NO |
R2RML-F | NO | NO |
The RMLMapper supports rml:parentTermMap: https://github.com/RMLio/rmlmapper-java/search?q=parentTermMap
:om_5 a rr:ObjectMap;
rr:parent "friendID".
EQUALS
:om_5 a rr:ObjectMap;
rml:parentTermMap :ptm_0.
:ptm_0 rml:reference "friendID".
This allows us to support joining on, e.g. constants (which is interesting when you want to join on IDs AND, e.g. filter on constant values in the join condition) or templates or function values.
It's a subtle extension, but greatly increases the complexity of the join operator.
Even if joining in RML gets a revamp, this is mostly a suggestion to allow for term maps instead of only references.
@andimou 's comment (putting it here so the discussion doesn't get lost):
I think when we defined this back in the days we had in mind a superclass of Term Map and Function Map as it might be a bit incorrect to say that the Function Map is a type of Term Map, because a Term Map expects at least one of the reference/template/constant or none but a term type to specify it's a blank node, whereas the Function Map as you also mention, expects a triples map.
In principle it's something like, an RDF term is generated by
- a Term Map, which may be
- reference-based
- constant-based
- template-based
- none of the above but with term type blank node
- Function Map
ie a function map is an alternative to what nowadays is a term map
(r I don't remember correct?)
The R2RML spec says
The subjects often are IRIs that are generated from the primary key column(s) of the table.
does this often refer to the IRIs or the primary key column?
I would think that this would be translate to
The subjects
oftenare IRIsthat are generated from the primary key column(s) of the table.
or could they be anything else as well?
In a recent discussion in kg-construct/rml-questions#28 that it might be useful to support generation of URIs next to IRIs would be to facilitate Linked Data dereferencing. As HTTP only supports URIs, implementing IRI dereferencing inevitably requires IRI to URI mapping below the surface (For example how DBpedia did it). It might therefor be valuable to also be able to generate URIs.
Note by @DylanVanAssche on this:
Maybe we need to have rr:IRI and rr:URI in the new spec as rr:termType?
This way, mapping rules can explicitly being clear about this.
Is this indeed a useful enough feature to add to RML?
The R2RML spec includes the following paragraph for duplicates handling:
Duplicate row preservation: For tables without a primary key, the Direct Graph requires that a fresh blank node is created for each row. This ensures that duplicate rows in such tables are preserved. This requirement is relaxed for R2RML default mappings: They MAY reuse the same blank node for multiple duplicate rows. This behaviour does not preserve duplicate rows. R2RML default mapping generators that provide default mappings based on the Direct Graph MUST document whether the generated default mapping preserves duplicate rows or not.
This needs to be adjusted in the case of heterogeneous data
Currently in the proposed spec, rml:graphMap is only usable under PredicateObjectMap and SubjectMap.
The current spec does not allow fine tuning where you want to put the named graphs at the predicate/object level without considering things at the PredicateObjectMap level.
In my opinion, enabling usage of rml:graphMap at all term map levels (SubjectMap, PredicateMap, and ObjectMap) will give more control to the user to decide, which graph map the output will belong to, at the term level.
It will also reduce redundancy when writing RML documents by keeping closely related POMs at one place instead of being spread out all over the document.
Suppose you want all triples with predicate <predicate AB>
to be in the named graphs <graph A>
and <graph B>
.
But, you want only <predicate C>
in named graph <graph C>
with the same object <object>
.
<subject> <predicateAB> <object> <graphA>.
<subject> <predicateAB> <object> <graphB>.
<subject> <predicateC> <object> <graphC>.
Current RML mapping to achieve this:
rml:subjectMap [
rml:template "subject"
];
rml:predicateObjectMap [
rml:predicateMap [
rml:constant ex:predicateAB
];
rml:graph ex:graphA;
rml:graph ex:graphB;
rml:object "object"
];
rml:predicateObjectMap [
rml:predicateMap [
rml:constant ex:predicateC;
];
rml:graph ex:graphC;
rml:object "object"
].
If GraphMaps could be defined at PredicateMap level, the following would be possible:
rml:subjectMap [
rml:template "subject"
];
rml:predicateObjectMap [
rml:predicateMap [
rml:constant ex:predicateAB;
rml:graph ex:graphA;
rml:graph ex:graphB;
];
rml:predicateMap [
rml:constant ex:predicateC;
rml:graph ex:graphC;
];
rml:object "object"
].
This also better aligns with FnO where FunctionMaps could be applied at every TermMap level (SubjectMap, PredicateMap, and ObjectMap)
Right now, termmaps with rml:template and termtype rml:IRI always does percent encoding and there is a note that non-encoded values can only be achieved using rml:reference and thus preprocessing is needed spec. Alternatively, a FnO string concatenation function could be used.
However, this turns out to be a very common case, e.g. creating a mailto:
resource from an emailadres: you just want to be able to do rml:template "mailto:{mailadres}"
, but that doesn't work because the @ will be percent-encoded (see RMLio/rmlmapper-java#219 (comment) for a recent similar request).
Does it make sense to introduce a rml:templateUnsafe
predicate or similar to have an easy-to-use construct for handling this common request? This does the same as rml:template, but never performs a reference value transforming function
Currently in the spec, the Expression Map is introduced as a subsection of the Term Maps section. This is strange, since it is something that is broader than a term map and is also used as a super class by non-term map constructs.
Furthermore, the description of Term Maps and Expression Map is combined in sections like "Reference-valued Expression Maps and term types".
The spec also currently states things like:
A reference-valued Expression Map generates an RDF Term which is by default a Literal.
This has deviated significantly from how it was described at first and not as originally intended. The idea was to introduce expression maps not as generating RDF terms, but simply as generating values on an input source using an expression.
This to allow it to be used for more things than only generating terms. For example join conditions.
I think the current mixing of Term Maps and expression map makes the spec difficult to follow. Also, it makes it harder for the other documents that need to reference sections about Expression Map aspects without Term Map aspects to do so.
I would prefer to see the Expression Map first introduced in a separate section as abstract functions on the input source, followed by a section about term maps which then concretely describes.
For instance, in Section 2.2 Mapping Graphs and the RML Vocabulary, there are a bunch of links to the spec and not a specific source. Specific nodes cannot be found in the spec either (e.g., Logical Source).
How joins are currently handled:
We refer to a Parent Triples Map but in fact we use the Subject Map of the Triples Map.
<#TM1>
rml:logicalSource <LS1> ;
rml:subjectMap <#SM1> ;
rml:predicateObjectMap [
rml:predicateMap <#PM1> ;
rml:objectMap [ rml:parentTriplesMap <#SM2> ] ].
<#TM2>
rml:logicalSource <LS2> ;
rml:subjectMap <#SM2> ;
rml:predicateObjectMap [
rml:predicateMap <#PM2> ;
rml:objectMap <#OM2> ] .
Limitations:
This limits us because a Subject Map can only be an IRI or a Blank Node but it cannot be a literal.
It also limits us as we can only join the object of the one triple with the subject of another triple.
Solution:
Actually refer to the RDF term which we want to reuse instead of the Triples Map.
In this case, the above would become:
<#TM1>
rml:logicalSource <LS1> ;
rml:subjectMap <#SM1> ;
rml:predicateObjectMap [
rml:predicateMap <#PM1> ;
rml:objectMap [ rml:parentTermMap <#SM2> ] ].
<#TM2>
rml:logicalSource <LS2> ;
rml:subjectMap <#SM2> ;
rml:predicateObjectMap [
rml:predicateMap <#PM2> ;
rml:objectMap <#OM2> ] .
(rml:parentTermMap
would be introduced to refer to the Term we want to reuse)
We can reuse the above to even refer to the object of <#TM2>
, e.g.:
<#TM1>
rml:logicalSource <LS1> ;
rml:subjectMap <#SM1> ;
rml:predicateObjectMap [
rml:predicateMap <#PM1> ;
rml:objectMap [ rml:parentTermMap <#OM2> ] ].
<#TM2>
rml:logicalSource <LS2> ;
rml:subjectMap <#SM2> ;
rml:predicateObjectMap [
rml:predicateMap <#PM2> ;
rml:objectMap <#OM2> ] .
and we can do that in any Term Map, e.g., in the following we join the Subject Map of the one TM with the Subject Map of the other TM:
<#TM1>
rml:logicalSource <LS1> ;
rml:subjectMap [ rml:parentTermMap <#SM2> ] ;
<#TM2>
rml:logicalSource <LS2> ;
rml:subjectMap <#SM2> .
That proposal would solve the issues with joins for literals but not the cases where we want to use data from two data sources as in kg-construct/rml-jc#3.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.