Giter Club home page Giter Club logo

morph-kgc's Introduction

morph

License DOI Latest PyPI version Python Version PyPI status build Documentation Status Open In Colab

Morph-KGC is an engine that constructs RDF knowledge graphs from heterogeneous data sources with the R2RML and RML mapping languages. Morph-KGC is built on top of pandas and it leverages mapping partitions to significantly reduce execution times and memory consumption for large data sources.

Features ✨

Documentation 📑

Read the documentation.

Tutorial 👩‍🏫

Learn quickly with the tutorial in Google Colaboratory!

Getting Started 🚀

PyPi is the fastest way to install Morph-KGC:

pip install morph-kgc

We recommend to use virtual environments to install Morph-KGC.

To run the engine via command line you just need to execute the following:

python3 -m morph_kgc config.ini

Check the documentation to see how to generate the configuration INI file. Here you can also see an example INI file.

It is also possible to run Morph-KGC as a library with RDFLib and Oxigraph:

import morph_kgc

# generate the triples and load them to an RDFLib graph
g_rdflib = morph_kgc.materialize('/path/to/config.ini')
# work with the RDFLib graph
q_res = g_rdflib.query('SELECT DISTINCT ?classes WHERE { ?s a ?classes }')

# generate the triples and load them to Oxigraph
g_oxigraph = morph_kgc.materialize_oxigraph('/path/to/config.ini')
# work with Oxigraph
q_res = g_oxigraph.query('SELECT DISTINCT ?classes WHERE { ?s a ?classes }')

# the methods above also accept the config as a string
config = """
            [DataSource1]
            mappings: /path/to/mapping/mapping_file.rml.ttl
            db_url: mysql+pymysql://user:password@localhost:3306/db_name
         """
g_rdflib = morph_kgc.materialize(config)

License 🔓

Morph-KGC is available under the Apache License 2.0.

Author & Contact 📬

Ontology Engineering Group, Universidad Politécnica de Madrid.

Citing 💬

If you used Morph-KGC in your work, please cite the SWJ paper:

@article{arenas2024morph,
  title     = {{Morph-KGC: Scalable knowledge graph materialization with mapping partitions}},
  author    = {Arenas-Guerrero, Julián and Chaves-Fraga, David and Toledo, Jhon and Pérez, María S. and Corcho, Oscar},
  journal   = {Semantic Web},
  publisher = {IOS Press},
  issn      = {2210-4968},
  year      = {2024},
  doi       = {10.3233/SW-223135},
  volume    = {15},
  number    = {1},
  pages     = {1-20}
}

Sponsor 🛡️

BASF

morph-kgc's People

Contributors

ahmad88me avatar arenas-guerrero-julian avatar dachafra avatar dylanvanassche avatar eltociear avatar ershimen avatar jatoledo avatar kappagi avatar luciacabanillasrodriguez avatar mielvds avatar ocorcho avatar pawelostr avatar therazorace avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

morph-kgc's Issues

Generate correct URIs

Encode the URIs for subject and templates correctly (e.g., example.org/NEW YORK --> example.org/NEW%20YORK)

Have a parameter in the config like:

uri.encode=(" "->"%20"),,(","->""),,("á"->"a"),,("é"->"e"),,("í"->"i"),,("ó"->"o"),,("ú"->"u"),,("ü"->"u"),,("ñ"->"n"),,("\u00B4"->"%C2%B4")

Morph-RDB has it.

ENH: XPath 3.0

Describe the Solution you'd Like

Switch to XPath 3.0 for querying XML

Additional Context

elementpath now supports XPath 3.0

Accept YARRRML

Accept yarrrml for mappings, this would be used directly to materialize or to convert to R2RML or RML. Ideally pretty [R2]RML would be generated.

Error when parsing mapping with rr:object

The mapping parser fails when it has an rr:object. To solve this convert constant shortcut properties (rr:subject, rr:predicate, rr:object and rr:graph) to non-shortcut BEFORE executing the parsing query. Do it simililar to the R2RML to RML, by replacing triples in the rdflib graph. E.g. the first POM would result in the second one:

rr:predicateObjectMap [ a rr:PredicateObjectMap;
      rr:object "true";
      rr:predicate :ex
    ];
  rr:predicateObjectMap [ a rr:PredicateObjectMap;
      rr:objectMap [ rr:constant "true" ];
      rr:predicate :ex
    ];

Do it for all constant shortcut properties, this way we also reduce the complexity of the parsing query :)

Oracle identifier casing results gives problem

Oracle represents case insensitive identifiers all in uppercase (https://docs.sqlalchemy.org/en/14/dialects/oracle.html#identifier-casing). This results in the problem of needind the identifier in uppercase in the SQL query (therefore also in the mappings), but sqlalchemy returns a dataframe with the identifiers in lowercase. When that dataframe is later accessed there is a mismatch between the original identifier and the identifier in the dataframe.

Key error when column references contain quotes

Test-case R2RMLTC0001a
Mapping and expected result: https://github.com/kg-construct/r2rml-implementation-report/tree/main/test-cases/R2RMLTC0001a
Database: https://github.com/kg-construct/r2rml-implementation-report/tree/main/test-cases/databases/d001.sql

Log:

Traceback (most recent call last):
  File "/Users/dchaves/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '"Name"'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "morph-kgc/semantify.py", line 25, in <module>
    materialize(mappings_df, config)
  File "/Users/dchaves/Downloads/morph-kgc/r2rml-implementation-report/test-cases/morph-kgc/materializer.py", line 274, in materialize
    result_triples = _materialize_mapping_rule(mapping_rule, subject_maps_df, config)
  File "/Users/dchaves/Downloads/morph-kgc/r2rml-implementation-report/test-cases/morph-kgc/materializer.py", line 230, in _materialize_mapping_rule
    query_results_df = _materialize_template(query_results_df, mapping_rule['subject_template'], termtype=mapping_rule['subject_termtype'])
  File "/Users/dchaves/Downloads/morph-kgc/r2rml-implementation-report/test-cases/morph-kgc/materializer.py", line 61, in _materialize_template
    query_results_df['reference_results'] = query_results_df[columns_alias + reference]
  File "/Users/dchaves/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2906, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/dchaves/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err
KeyError: '"Name"'

Error in mapping rules that do not generate triples

For hierarchical data formats (JSON and XML), if a reference in the mapping is not present in the data file, then an error is raised. The intermediate dataframe should have all references from the mapping rule (even when they are not present in the data file) to avoid errors

Pandas warning

What Happens?

Python warning, possible future problem:

morph-kgc-api_1  | INFO | 2022-02-17 23:59:08,719 | Mapping partition with 8 groups generated.
morph-kgc-api_1  | INFO | 2022-02-17 23:59:08,719 | Maximum number of rules within mapping group: 11.
morph-kgc-api_1  | INFO | 2022-02-17 23:59:08,722 | Mappings processed in 0.357 seconds.
morph-kgc-api_1  | /usr/local/lib/python3.8/dist-packages/morph_kgc/data_source/data_file.py:148: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
morph-kgc-api_1  |   json_df = json_df[references]
morph-kgc-api_1  | /usr/local/lib/python3.8/dist-packages/morph_kgc/data_source/data_file.py:148: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
morph-kgc-api_1  |   json_df = json_df[references]
morph-kgc-api_1  | /usr/local/lib/python3.8/dist-packages/morph_kgc/data_source/data_file.py:148: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
morph-kgc-api_1  |   json_df = json_df[references]
morph-kgc-api_1  | /usr/local/lib/python3.8/dist-packages/morph_kgc/data_source/data_file.py:148: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
morph-kgc-api_1  |   json_df = json_df[references]
morph-kgc-api_1  | /usr/local/lib/python3.8/dist-packages/morph_kgc/data_source/data_file.py:148: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
morph-kgc-api_1  |   json_df = json_df[references]
morph-kgc-api_1  | /usr/local/lib/python3.8/dist-packages/morph_kgc/data_source/data_file.py:148: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
morph-kgc-api_1  |   json_df = json_df[references]
morph-kgc-api_1  | /usr/local/lib/python3.8/dist-packages/morph_kgc/data_source/data_file.py:148: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
morph-kgc-api_1  |   json_df = json_df[references]
morph-kgc-api_1  | /usr/local/lib/python3.8/dist-packages/morph_kgc/data_source/data_file.py:148: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
morph-kgc-api_1  |   json_df = json_df[references]
morph-kgc-api_1  | /usr/local/lib/python3.8/dist-packages/morph_kgc/data_source/data_file.py:148: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
morph-kgc-api_1  |   json_df = json_df[references]
morph-kgc-api_1  | /usr/local/lib/python3.8/dist-packages/morph_kgc/data_source/data_file.py:148: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
morph-kgc-api_1  |   json_df = json_df[references]
morph-kgc-api_1  | /usr/local/lib/python3.8/dist-packages/morph_kgc/data_source/data_file.py:148: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
morph-kgc-api_1  |   json_df = json_df[references]
morph-kgc-api_1  | /usr/local/lib/python3.8/dist-packages/morph_kgc/data_source/data_file.py:148: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
morph-kgc-api_1  |   json_df = json_df[references]
morph-kgc-api_1  | /usr/local/lib/python3.8/dist-packages/morph_kgc/data_source/data_file.py:148: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
morph-kgc-api_1  |   json_df = json_df[references]
morph-kgc-api_1  | /usr/local/lib/python3.8/dist-packages/morph_kgc/data_source/data_file.py:149: SettingWithCopyWarning: 
morph-kgc-api_1  | A value is trying to be set on a copy of a slice from a DataFrame
morph-kgc-api_1  | 
morph-kgc-api_1  | See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
morph-kgc-api_1  |   json_df.dropna(axis=0, how='any', inplace=True)

To Reproduce

Run the JSON example from the repo.

Environment (please complete the following information):

  • OS: Ubuntu 20.04 inside a Docker container
  • Python version: 3.8
  • Morph-KGC version: 1.5.0

Add testing

Add tests with pytest.

This should consider at least the RML test cases and the R2RML test cases (with SQLite).

Allow ontology as input parameter

Validating the mappings according to the ontology is specially relevant when sources are complex and there are many mapping rules

Predicate constants that start the same with the same string go to the same partition

The following constants (in predicates) of GTFS benchmark are put in the same partition, since the shortest constant is similar to the begining of the other constants. Try to avoid this to get as many mapping partitions as possible.

http://vocab.gtfs.org/terms#route
http://vocab.gtfs.org/terms#routeType
http://vocab.gtfs.org/terms#routeUrl
http://vocab.gtfs.org/terms#stop
http://vocab.gtfs.org/terms#stopSequence

Mapping partitioning on objects

Add objects as mapping partitioning criteria. Namely, if a datatype or language tag is available in all parsed mapping rules that have literals as objects, and templates or constants are used for objects for the rest mapping rules, it is possible to use objects as mapping partitioning criteria. Example:

"ordenador"@sp
"computer"@en
"4"^^xsd:integer
"13/02/2021"^^xsd:date
<http://example.org/resource/stop/4>

The mapping rule that generate these objects can be used for mapping partitioning. If a mapping rule that generates a literal does not provide a language tag or datatype, it does not limit the partitioning criteria as those literals could be placed in a different group. Example:

"computer"
"ordenador"
"4"

Mapping rules that generate this objects can be placed in the same group, different to the groups used in the objects considered before.

Management Empty sets

The Test-case R2RMLTC0000 tests an empty database with a valid schema, morph-kgc fails bc it tries to find answers.

Mapping and result: https://github.com/kg-construct/r2rml-implementation-report/tree/main/test-cases/R2RMLTC0000
Database: https://github.com/kg-construct/r2rml-implementation-report/blob/main/test-cases/databases/d000.sql

Log:

Traceback (most recent call last):
  File "/Users/dchaves/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '"Name"'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "morph-kgc/semantify.py", line 25, in <module>
    materialize(mappings_df, config)
  File "/Users/dchaves/Downloads/morph-kgc/r2rml-implementation-report/test-cases/morph-kgc/materializer.py", line 274, in materialize
    result_triples = _materialize_mapping_rule(mapping_rule, subject_maps_df, config)
  File "/Users/dchaves/Downloads/morph-kgc/r2rml-implementation-report/test-cases/morph-kgc/materializer.py", line 230, in _materialize_mapping_rule
    query_results_df = _materialize_template(query_results_df, mapping_rule['subject_template'], termtype=mapping_rule['subject_termtype'])
  File "/Users/dchaves/Downloads/morph-kgc/r2rml-implementation-report/test-cases/morph-kgc/materializer.py", line 61, in _materialize_template
    query_results_df['reference_results'] = query_results_df[columns_alias + reference]
  File "/Users/dchaves/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2906, in _getitem_
    indexer = self.columns.get_loc(key)
  File "/Users/dchaves/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err
KeyError: '"Name"'

Support CSV

Allow all tabular formats: parquet, excel, tsv.

Create a constants.py to store there engine constants

Example in Morph-RDB: https://github.com/oeg-upm/morph-rdb/blob/a76de03787b2a0b18288d1bd9dec9c374554599b/morph-base/src/main/scala/es/upm/fi/dia/oeg/morph/base/Constants.scala

Include for instance variables for:

http://www.w3.org/ns/r2rml#Literal
http://www.w3.org/ns/r2rml#IRI

Include what already is distributed across the differnt modules:

SQL_RDF_DATATYPE = {
    'INTEGER': 'http://www.w3.org/2001/XMLSchema#integer',
    'INT': 'http://www.w3.org/2001/XMLSchema#integer',
    'SMALLINT': 'http://www.w3.org/2001/XMLSchema#integer',
    'DECIMAL': 'http://www.w3.org/2001/XMLSchema#decimal',
    'NUMERIC': 'http://www.w3.org/2001/XMLSchema#decimal',
    'FLOAT': 'http://www.w3.org/2001/XMLSchema#double',
    'REAL': 'http://www.w3.org/2001/XMLSchema#double',
    'DOUBLE': 'http://www.w3.org/2001/XMLSchema#double',
    'BOOL': 'http://www.w3.org/2001/XMLSchema#boolean',
    'TINYINT': 'http://www.w3.org/2001/XMLSchema#boolean',
    'BOOLEAN': 'http://www.w3.org/2001/XMLSchema#boolean',
    'DATE': 'http://www.w3.org/2001/XMLSchema#date',
    'TIME': 'http://www.w3.org/2001/XMLSchema#time',
    'DATETIME': 'http://www.w3.org/2001/XMLSchema#',
    'TIMESTAMP': 'http://www.w3.org/2001/XMLSchema#dateTime',
    'BINARY': 'http://www.w3.org/2001/XMLSchema#hexBinary',
    'VARBINARY': 'http://www.w3.org/2001/XMLSchema#hexBinary',
    'BIT': 'http://www.w3.org/2001/XMLSchema#hexBinary',
    'YEAR': 'http://www.w3.org/2001/XMLSchema#integer'
}
ARGUMENTS_DEFAULT = {
    'output_dir': 'output',
    'output_file': 'result',
    'output_format': 'nquads',
    'clean_output_dir': 'yes',
    'mapping_partitions': 'guess',
    'input_parsed_mappings_path': '',
    'output_parsed_mappings_path': '',
    'logs_file': '',
    'logging_level': 'info',
    'push_down_sql_distincts': 'no',
    'number_of_processes': mp.cpu_count(),
    'process_start_method': 'default',
    'async': 'no',
    'chunksize': 100000,
    'infer_datatypes': 'yes',
    'coerce_float': 'no',
    'only_printable_characters': 'no'
}
VALID_ARGUMENTS = {
    'output_format': ['ntriples', 'nquads'],
    'mapping_partitions': ['', 's', 'p', 'g', 'sp', 'sg', 'pg', 'spg', 'guess'],
    'relational_source_type': ['mysql', 'postgresql', 'oracle', 'sqlserver'],
    'file_source_type': [],
    'process_start_method': ['default', 'spawn', 'fork', 'forkserver'],
    'logging_level': ['notset', 'debug', 'info', 'warning', 'error', 'critical']
}

Multiprocessing deadlock for large tasks

When using multiprocessing with large tasks (GTFS-Madrid-Bench size 500 for CSV format) the engine does not finish execution due to deadlock. Error trace:

^CProcess ForkPoolWorker-5:
Process ForkPoolWorker-4:
Process ForkPoolWorker-2:
Traceback (most recent call last):
  File "morph-kgc.py", line 61, in <module>
    process_materialization(mappings, config)
  File "morph-kgc.py", line 50, in process_materialization
    materializer.materialize_concurrently()
  File "/home/julian/PycharmProjects/Morph-KGC/morph-kgc/materializer.py", line 347, in materialize_concurrently
    num_triples = sum(pool.starmap(_materialize_mapping_partition,
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 765, in get
    self.wait(timeout)
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 762, in wait
    self._event.wait(timeout)
  File "/usr/lib/python3.8/threading.py", line 558, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib/python3.8/threading.py", line 302, in wait
    waiter.acquire()
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 355, in get
    with self._rlock:
  File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 355, in get
    with self._rlock:
  File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 355, in get
    with self._rlock:
  File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
^C 

DBMS support

Currently Morph supports MySQL for relational data. It would be nice to have to add additional support for:

  • PostgreSQL
  • Oracle
  • SQLserver
  • SQLite
  • DB2
  • MariaDB

Update based on how DBMS deal with types to be added, and update the translation of datatypes from SQL to SPARQL:

SQL_RDF_DATATYPE = {
    'INTEGER': 'http://www.w3.org/2001/XMLSchema#integer',
    'INT': 'http://www.w3.org/2001/XMLSchema#integer',
    'SMALLINT': 'http://www.w3.org/2001/XMLSchema#integer',
    'DECIMAL': 'http://www.w3.org/2001/XMLSchema#decimal',
    'NUMERIC': 'http://www.w3.org/2001/XMLSchema#decimal',
    'FLOAT': 'http://www.w3.org/2001/XMLSchema#double',
    'REAL': 'http://www.w3.org/2001/XMLSchema#double',
    'DOUBLE': 'http://www.w3.org/2001/XMLSchema#double',
    'BOOL': 'http://www.w3.org/2001/XMLSchema#boolean',
    'TINYINT': 'http://www.w3.org/2001/XMLSchema#boolean',
    'BOOLEAN': 'http://www.w3.org/2001/XMLSchema#boolean',
    'DATE': 'http://www.w3.org/2001/XMLSchema#date',
    'TIME': 'http://www.w3.org/2001/XMLSchema#time',
    'DATETIME': 'http://www.w3.org/2001/XMLSchema#',
    'TIMESTAMP': 'http://www.w3.org/2001/XMLSchema#dateTime',
    'BINARY': 'http://www.w3.org/2001/XMLSchema#hexBinary',
    'VARBINARY': 'http://www.w3.org/2001/XMLSchema#hexBinary',
    'BIT': 'http://www.w3.org/2001/XMLSchema#hexBinary',
    'YEAR': 'http://www.w3.org/2001/XMLSchema#integer'
}

Improve mapping loading

  • Load [R2]RML from any RDF serialization. Support as many serializations as possible.
  • Support dir path to mapping of a data source (currently supporting only file or list of files).

Quotes in queries

' to enclose string literals
` to enclose identifiers (table and column names)

Is temporary directory needed for results of mapping partitions?

Currently write the results of mapping partitions in independent files in a temporary directory. Once I generate all triples, I unify the files. This may make sense to not having problems writing to file with parallelization.

When parallelization is enables, check if this temporary directory is needed or we can write directly to the final file.

BUG: Named Graph can't be created

What Happens?

Creating a named graph using YARRRML does not work (https://rml.io/yarrrml/tutorial/getting-started/#how-to-add-triples-to-a-graph).

The error Found an invalid graph termtype. Found values ['http://www.w3.org/ns/r2rml#Literal']. Graph maps must be http://www.w3.org/ns/r2rml#IRI. appears when using morph-kgc (within kglab)

To Reproduce

test.csv

I am building a YARRRML rules file called rules.yml:

prefixes:
  ex: http://example.com/
  schema: https://schema.org/

mappings:
  TestMapping:
    sources: 
      - [test.csv~csv]
    graph: ex:named_graph # this is the lines that breaks the workflow
    s: ex:$(id)
    po:
      - [schema:name, $(name)]
      - [schema:type, $(type)]

I convert this to a rml using the yarrrml-parser (yarrrml-parser -i rules.yml -o mapping.rml.ttl ). This results in this mapping.rml file

@prefix rr: <http://www.w3.org/ns/r2rml#>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix fnml: <http://semweb.mmlab.be/ns/fnml#>.
@prefix fno: <https://w3id.org/function/ontology#>.
@prefix d2rq: <http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#>.
@prefix void: <http://rdfs.org/ns/void#>.
@prefix dc: <http://purl.org/dc/terms/>.
@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql: <http://semweb.mmlab.be/ns/ql#>.
@prefix : <http://mapping.example.com/>.
@prefix ex: <http://example.com/>.
@prefix schema: <https://schema.org/>.

:rules_000 a void:Dataset;
    void:exampleResource :map_TestMapping_000.
:map_TestMapping_000 rml:logicalSource :source_000.
:source_000 a rml:LogicalSource;
    rml:source "test.csv";
    rml:referenceFormulation ql:CSV.
:map_TestMapping_000 a rr:TriplesMap;
    rdfs:label "TestMapping".
:s_000 a rr:SubjectMap.
:map_TestMapping_000 rr:subjectMap :s_000.
:s_000 rr:template "http://example.com/{id}";
    rr:graphMap :gm_000.
:gm_000 a rr:GraphMap;
    rr:constant "http://example.com/named_graph".
:pom_000 a rr:PredicateObjectMap.
:map_TestMapping_000 rr:predicateObjectMap :pom_000.
:pm_000 a rr:PredicateMap.
:pom_000 rr:predicateMap :pm_000.
:pm_000 rr:constant schema:name.
:pom_000 rr:objectMap :om_000.
:om_000 a rr:ObjectMap;
    rml:reference "name";
    rr:termType rr:Literal.
:pom_001 a rr:PredicateObjectMap.
:map_TestMapping_000 rr:predicateObjectMap :pom_001.
:pm_001 a rr:PredicateMap.
:pom_001 rr:predicateMap :pm_001.
:pm_001 rr:constant schema:type.
:pom_001 rr:objectMap :om_001.
:om_001 a rr:ObjectMap;
    rml:reference "type";
    rr:termType rr:Literal.

I use this small script to create the RDF triples:

import kglab

namespaces = {
    "ex":  "http://example.com/",
    "schema": "https://schema.org/"
    }
kg = kglab.KnowledgeGraph(
    name = "A KG example",
    namespaces = namespaces,
    )
kg.materialize('config.ini')
kg.save_rdf("rdf-triples.ttl")

The config file contains:

[CONFIGURATION]
logging_level=DEBUG

[DataSource1]
mappings=mapping.rml.ttl

Environment (please complete the following information):

  • OS: MacOS 12.2.1
  • Python version: 3.9.10
  • Morph-KGC version: 1.6.0 (kglab 0.4.4)

Test the engine with [R2]RML test cases

Use the test cases from R2RML and RML as unit tests, for ensuring the conformance of the engine with the specifications.

The output should be two different EARL reports. An option could be to use the mappings and rules developed for the RML implementation report: https://github.com/RMLio/rml-implementation-report/blob/master/rules.yarrrml.yml

In RML, only MySQL at this moment should be tested, the rest should appear in the report as http://www.w3.org/ns/earl#inapplicable

Does not find the column value

Describe the bug
Does not find the column value

To Reproduce

  1.  config = """
     [DEFAULT]
     main_dir: ./
     mappings_dir: ./Yarrmlmappings
     
     [CONFIGURATION]
     output_dir = ${main_dir}output
     output_file = result
     
     [DataSource1]
     source_type = JSON
     mappings = ${mappings_dir}/mappingOA.ttl
     """
    graph = morph_kgc.materialize(config)
2. ** Mapping **
``` yarml
  @prefix rr: <http://www.w3.org/ns/r2rml#>.
  @prefix rml: <http://semweb.mmlab.be/ns/rml#>.
  @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
  @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
  @prefix ql: <http://semweb.mmlab.be/ns/ql#>.
  @prefix map: <http://mapping.example.com/>.
  
  map:jc_000 rr:child "id";
      rr:parent "id".
  map:language_000 rml:reference "language.code".
  map:map_Author_000 rml:logicalSource map:source_001;
      a rr:TriplesMap;
      rdfs:label "Author";
      rr:subjectMap map:s_000;
      rr:predicateObjectMap map:pom_000, map:pom_001, map:pom_002, map:pom_003, map:pom_004.
  map:map_idTypeOpenAire_000 rml:logicalSource map:source_002;
      a rr:TriplesMap;
      rdfs:label "idTypeOpenAire";
      rr:subjectMap map:s_002;
      rr:predicateObjectMap map:pom_020, map:pom_021, map:pom_022.
  map:map_idTypePaperResource_000 rml:logicalSource map:source_003;
      a rr:TriplesMap;
      rdfs:label "idTypePaperResource";
      rr:subjectMap map:s_003;
      rr:predicateObjectMap map:pom_023, map:pom_024, map:pom_025.
  map:map_Paper_000 rml:logicalSource map:source_000;
      a rr:TriplesMap;
      rdfs:label "Paper";
      rr:subjectMap map:s_001;
      rr:predicateObjectMap map:pom_005, map:pom_006, map:pom_007, map:pom_008, map:pom_009, map:pom_010, map:pom_011, map:pom_012, map:pom_013, map:pom_014, map:pom_015, map:pom_016, map:pom_017, map:pom_018, map:pom_019.
  map:om_000 a rr:ObjectMap;
      rr:constant "https://w3id.org/okn/os/o/Author";
      rr:termType rr:IRI.
  map:om_001 a rr:ObjectMap;
      rml:reference "fullname";
      rr:termType rr:Literal.
  map:om_002 a rr:ObjectMap;
      rml:reference "name";
      rr:termType rr:Literal.
  map:om_003 a rr:ObjectMap;
      rml:reference "surname";
      rr:termType rr:Literal.
  map:om_004 a rr:ObjectMap;
      rml:reference "pid.id.value";
      rr:termType rr:Literal.
  map:om_005 a rr:ObjectMap;
      rr:constant "https://w3id.org/okn/os/o/Paper";
      rr:termType rr:IRI.
  map:om_006 a rr:ObjectMap;
      rml:reference "maintitle";
      rr:termType rr:Literal.
  map:om_007 a rr:ObjectMap;
      rml:reference "subtitle";
      rr:termType rr:Literal.
  map:om_008 a rr:ObjectMap;
      rml:reference "description.*";
      rr:termType rr:Literal;
      rml:languageMap map:language_000.
  map:om_009 a rr:ObjectMap;
      rml:reference "language.label";
      rr:termType rr:Literal.
  map:om_010 a rr:ObjectMap;
      rml:reference "format.*";
      rr:termType rr:Literal.
  map:om_011 a rr:ObjectMap;
      rml:reference "publicationdate";
      rr:termType rr:Literal.
  map:om_012 a rr:ObjectMap;
      rml:reference "type";
      rr:termType rr:Literal.
  map:om_013 a rr:ObjectMap;
      rml:reference "country.*.label";
      rr:termType rr:Literal.
  map:om_014 a rr:ObjectMap;
      rml:reference "instance.*.license";
      rr:termType rr:Literal.
  map:om_015 a rr:ObjectMap;
      rml:reference "publisher";
      rr:termType rr:Literal.
  map:om_016 a rr:ObjectMap;
      rml:reference "source";
      rr:termType rr:Literal.
  map:om_017 a rr:ObjectMap;
      rr:template "https://w3id.org/okn/os/i/idType/{pid.*.value}";
      rr:termType rr:IRI.
  map:om_018 a rr:ObjectMap;
      rr:template "https://w3id.org/okn/os/i/author/{author.*.fullname}";
      rr:termType rr:IRI.
  map:om_019 a rr:ObjectMap;
      rr:parentTriplesMap map:map_idTypeOpenAire_000;
      rr:joinCondition map:jc_000.
  map:om_020 a rr:ObjectMap;
      rr:constant "https://w3id.org/okn/os/o/idType";
      rr:termType rr:IRI.
  map:om_021 a rr:ObjectMap;
      rr:constant "OpenAire";
      rr:termType rr:Literal.
  map:om_022 a rr:ObjectMap;
      rml:reference "id";
      rr:termType rr:Literal.
  map:om_023 a rr:ObjectMap;
      rr:constant "https://w3id.org/okn/os/o/idType";
      rr:termType rr:IRI.
  map:om_024 a rr:ObjectMap;
      rml:reference "scheme";
      rr:termType rr:Literal.
  map:om_025 a rr:ObjectMap;
      rml:reference "value";
      rr:termType rr:Literal.
  map:pm_000 a rr:PredicateMap;
      rr:constant rdf:type.
  map:pm_001 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/fullname>.
  map:pm_002 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/name>.
  map:pm_003 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/surname>.
  map:pm_004 a rr:PredicateMap;
      rr:template "https://w3id.org/okn/os/o/{pid.id.scheme}ID".
  map:pm_005 a rr:PredicateMap;
      rr:constant rdf:type.
  map:pm_006 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/title>.
  map:pm_007 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/subtitle>.
  map:pm_008 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/description>.
  map:pm_009 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/language>.
  map:pm_010 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/format>.
  map:pm_011 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/publicationDate>.
  map:pm_012 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/type>.
  map:pm_013 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/country>.
  map:pm_014 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/license>.
  map:pm_015 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/publisher>.
  map:pm_016 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/source>.
  map:pm_017 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/has_id>.
  map:pm_018 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/has_id>.
  map:pm_019 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/has_id>.
  map:pm_020 a rr:PredicateMap;
      rr:constant rdf:type.
  map:pm_021 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/source>.
  map:pm_022 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/identifier>.
  map:pm_023 a rr:PredicateMap;
      rr:constant rdf:type.
  map:pm_024 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/source>.
  map:pm_025 a rr:PredicateMap;
      rr:constant <https://w3id.org/okn/os/o/identifier>.
  map:pom_000 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_000;
      rr:objectMap map:om_000.
  map:pom_001 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_001;
      rr:objectMap map:om_001.
  map:pom_002 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_002;
      rr:objectMap map:om_002.
  map:pom_003 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_003;
      rr:objectMap map:om_003.
  map:pom_004 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_004;
      rr:objectMap map:om_004.
  map:pom_005 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_005;
      rr:objectMap map:om_005.
  map:pom_006 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_006;
      rr:objectMap map:om_006.
  map:pom_007 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_007;
      rr:objectMap map:om_007.
  map:pom_008 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_008;
      rr:objectMap map:om_008.
  map:pom_009 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_009;
      rr:objectMap map:om_009.
  map:pom_010 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_010;
      rr:objectMap map:om_010.
  map:pom_011 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_011;
      rr:objectMap map:om_011.
  map:pom_012 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_012;
      rr:objectMap map:om_012.
  map:pom_013 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_013;
      rr:objectMap map:om_013.
  map:pom_014 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_014;
      rr:objectMap map:om_014.
  map:pom_015 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_015;
      rr:objectMap map:om_015.
  map:pom_016 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_016;
      rr:objectMap map:om_016.
  map:pom_017 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_017;
      rr:objectMap map:om_017.
  map:pom_018 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_018;
      rr:objectMap map:om_018.
  map:pom_019 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_019;
      rr:objectMap map:om_019.
  map:pom_020 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_020;
      rr:objectMap map:om_020.
  map:pom_021 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_021;
      rr:objectMap map:om_021.
  map:pom_022 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_022;
      rr:objectMap map:om_022.
  map:pom_023 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_023;
      rr:objectMap map:om_023.
  map:pom_024 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_024;
      rr:objectMap map:om_024.
  map:pom_025 a rr:PredicateObjectMap;
      rr:predicateMap map:pm_025;
      rr:objectMap map:om_025.
  map:rules_000 a <http://rdfs.org/ns/void#Dataset>;
      <http://rdfs.org/ns/void#exampleResource> map:map_Author_000, map:map_Paper_000, map:map_idTypeOpenAire_000, map:map_idTypePaperResource_000.
  map:s_000 a rr:SubjectMap;
      rr:template "https://w3id.org/okn/os/i/author/{fullname}".
  map:s_001 a rr:SubjectMap;
      rr:template "https://w3id.org/okn/os/i/paper/{id}".
  map:s_002 a rr:SubjectMap;
      rr:template "https://w3id.org/okn/os/i/idType/{id}".
  map:s_003 a rr:SubjectMap;
      rr:template "https://w3id.org/okn/os/i/idType/{value}".
  map:source_000 a rml:LogicalSource;
      rdfs:label "main-source";
      rml:source "data.json";
      rml:iterator "$.*";
      rml:referenceFormulation ql:JSONPath.
  map:source_001 a rml:LogicalSource;
      rdfs:label "author-source";
      rml:source "data.json";
      rml:iterator "$.*.author[*]";
      rml:referenceFormulation ql:JSONPath.
  map:source_002 a rml:LogicalSource;
      rdfs:label "pid-source";
      rml:source "data.json";
      rml:iterator "$.*.pid[*]";
      rml:referenceFormulation ql:JSONPath.
  map:source_003 a rml:LogicalSource;
      rml:source "data.json";
      rml:iterator "$.*.pid[*]";
      rml:referenceFormulation ql:JSONPath.

  1. ERROR
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_165859/1161552163.py in <module>
----> 1 graph = morph_kgc.materialize(config)

~/OpenScience-Public/venv/lib/python3.8/site-packages/morph_kgc-1.3.4-py3.8.egg/morph_kgc/__init__.py in materialize(config)
     33         triples = set()
     34         for i, mapping_rule in mapping_partition.iterrows():
---> 35             triples.update(set(_materialize_mapping_rule(mapping_rule, subject_maps_df, config)))
     36 
     37             logging.debug(str(len(triples)) + ' triples generated for mapping rule `' + str(mapping_rule['id']) + '`.')

~/OpenScience-Public/venv/lib/python3.8/site-packages/morph_kgc-1.3.4-py3.8.egg/morph_kgc/materializer.py in _materialize_mapping_rule(mapping_rule, subject_maps_df, config)
    270             result_chunks = get_sql_data(config, mapping_rule, references)
    271         elif mapping_rule['source_type'] in FILE_SOURCE_TYPES:
--> 272             result_chunks = get_file_data(config, mapping_rule, references)
    273 
    274         for query_results_chunk_df in result_chunks:

~/OpenScience-Public/venv/lib/python3.8/site-packages/morph_kgc-1.3.4-py3.8.egg/morph_kgc/data_source/data_file.py in get_file_data(config, mapping_rule, references)
     35         return _read_spss(mapping_rule, references)
     36     elif file_source_type == JSON:
---> 37         return _read_json(mapping_rule, references)
     38     elif file_source_type == XML:
     39         return _read_xml(mapping_rule, references)

~/OpenScience-Public/venv/lib/python3.8/site-packages/morph_kgc-1.3.4-py3.8.egg/morph_kgc/data_source/data_file.py in _read_json(mapping_rule, references)
    136     json_df = pd.DataFrame.from_records(jsonpath_result)
    137 
--> 138     json_df = json_df[references]
    139 
    140     return [json_df]

~/OpenScience-Public/venv/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3462             if is_iterator(key):
   3463                 key = list(key)
-> 3464             indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
   3465 
   3466         # take() does not accept boolean indexers

~/OpenScience-Public/venv/lib/python3.8/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
   1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1313 
-> 1314         self._validate_read_indexer(keyarr, indexer, axis)
   1315 
   1316         if needs_i8_conversion(ax.dtype) or isinstance(

~/OpenScience-Public/venv/lib/python3.8/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1372                 if use_interval_msg:
   1373                     key = list(key)
-> 1374                 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())

KeyError: "None of [Index(['fullname'], dtype='object')] are in the [columns]"

Additional context
Data will be shared privately :)

Support data errors

Notify when data errors occur. Given that this might be an expensive task, give option in config to disable it.

QST: rml:iterator for JSONPath

I am checking the morph-kgc for transforming a hierarchical json into rdf triples given a custom ontology. I want to create a mapping with a logical source using the following a JSONPath that returns all json objects that have the value "Person" in their property "actor_type" independent in the json depth one could meet them. I have written the following logical Source:

<#GenericPersonMapping>
    a rr:TriplesMap;

    rml:logicalSource [
        rml:source "data.json";
        rml:referenceFormulation ql:JSONPath;                        
        rml:iterator "$..*.[?(@.actor_type ==\"Person\")";                     
    ];
      
    # Class ms:Person
    rr:subjectMap [ 
        a rr:Subject;
        rr:template "http://www.example.com/person/{pk}";
        rr:termType rr:IRI;
        rr:class ms:Person;
    ];

The problem is that is does not seems to generate any triples.
I have tested the pattern in https://jsonpath.curiousconcept.com using the same input and it seems to return the expected result. Any ideas what it could go wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.