rdflib / rdflib-hdt Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 9.0 6.33 MB

A Store back-end for rdflib to allow for reading and querying HDT documents

Home Page: https://rdflib.dev/rdflib-hdt

License: MIT License

C++ 65.97% Shell 0.71% Python 33.32%

hdt python rdf rdflib sparql store

rdflib-hdt's Introduction

RDFLib

RDFLib is a pure Python package for working with RDF. RDFLib contains most things you need to work with RDF, including:

parsers and serializers for RDF/XML, N3, NTriples, N-Quads, Turtle, TriX, Trig and JSON-LD
a Graph interface which can be backed by any one of a number of Store implementations
store implementations for in-memory, persistent on disk (Berkeley DB) and remote SPARQL endpoints
a SPARQL 1.1 implementation - supporting SPARQL 1.1 Queries and Update statements
SPARQL function extension mechanisms

RDFlib Family of packages

The RDFlib community maintains many RDF-related Python code repositories with different purposes. For example:

rdflib - the RDFLib core
sparqlwrapper - a simple Python wrapper around a SPARQL service to remotely execute your queries
pyLODE - An OWL ontology documentation tool using Python and templating, based on LODE.
pyrdfa3 - RDFa 1.1 distiller/parser library: can extract RDFa 1.1/1.0 from (X)HTML, SVG, or XML in general.
pymicrodata - A module to extract RDF from an HTML5 page annotated with microdata.
pySHACL - A pure Python module which allows for the validation of RDF graphs against SHACL graphs.
OWL-RL - A simple implementation of the OWL2 RL Profile which expands the graph with all possible triples that OWL RL defines.

Please see the list for all packages/repositories here:

https://github.com/RDFLib

Help with maintenance of all of the RDFLib family of packages is always welcome and appreciated.

Versions & Releases

7.1.0a0 current main branch.
7.x.y current release, supports Python 3.8.1+ only.
- see Releases
6.x.y supports Python 3.7+ only. Many improvements over 5.0.0
- see Releases
5.x.y supports Python 2.7 and 3.4+ and is mostly backwards compatible with 4.2.2.

See https://rdflib.dev for the release overview.

Documentation

See https://rdflib.readthedocs.io for our documentation built from the code. Note that there are latest, stable 5.0.0 and 4.2.2 documentation versions, matching releases.

Installation

The stable release of RDFLib may be installed with Python's package management tool pip:

$ pip install rdflib

Some features of RDFLib require optional dependencies which may be installed using pip extras:

$ pip install rdflib[berkeleydb,networkx,html,lxml]

Alternatively manually download the package from the Python Package Index (PyPI) at https://pypi.python.org/pypi/rdflib

The current version of RDFLib is 7.0.0, see the CHANGELOG.md file for what's new in this release.

Installation of the current main branch (for developers)

With pip you can also install rdflib from the git repository with one of the following options:

$ pip install git+https://github.com/rdflib/rdflib@main

$ pip install -e git+https://github.com/rdflib/rdflib@main#egg=rdflib

or from your locally cloned repository you can install it with one of the following options:

$ poetry install  # installs into a poetry-managed venv

$ pip install -e .

Getting Started

RDFLib aims to be a pythonic RDF API. RDFLib's main data object is a Graph which is a Python collection of RDF Subject, Predicate, Object Triples:

To create graph and load it with RDF data from DBPedia then print the results:

from rdflib import Graph
g = Graph()
g.parse('http://dbpedia.org/resource/Semantic_Web')

for s, p, o in g:
    print(s, p, o)

The components of the triples are URIs (resources) or Literals (values).

URIs are grouped together by namespace, common namespaces are included in RDFLib:

from rdflib.namespace import DC, DCTERMS, DOAP, FOAF, SKOS, OWL, RDF, RDFS, VOID, XMLNS, XSD

You can use them like this:

from rdflib import Graph, URIRef, Literal
from rdflib.namespace import RDFS, XSD

g = Graph()
semweb = URIRef('http://dbpedia.org/resource/Semantic_Web')
type = g.value(semweb, RDFS.label)

Where RDFS is the RDFS namespace, XSD the XML Schema Datatypes namespace and g.value returns an object of the triple-pattern given (or an arbitrary one if multiple exist).

Or like this, adding a triple to a graph g:

g.add((
    URIRef("http://example.com/person/nick"),
    FOAF.givenName,
    Literal("Nick", datatype=XSD.string)
))

The triple (in n-triples notation) <http://example.com/person/nick> <http://xmlns.com/foaf/0.1/givenName> "Nick"^^<http://www.w3.org/2001/XMLSchema#string> . is created where the property FOAF.givenName is the URI <http://xmlns.com/foaf/0.1/givenName> and XSD.string is the URI <http://www.w3.org/2001/XMLSchema#string>.

You can bind namespaces to prefixes to shorten the URIs for RDF/XML, Turtle, N3, TriG, TriX & JSON-LD serializations:

g.bind("foaf", FOAF)
g.bind("xsd", XSD)

This will allow the n-triples triple above to be serialised like this:

print(g.serialize(format="turtle"))

With these results:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

<http://example.com/person/nick> foaf:givenName "Nick"^^xsd:string .

New Namespaces can also be defined:

dbpedia = Namespace('http://dbpedia.org/ontology/')

abstracts = list(x for x in g.objects(semweb, dbpedia['abstract']) if x.language=='en')

Features

The library contains parsers and serializers for RDF/XML, N3, NTriples, N-Quads, Turtle, TriX, JSON-LD, RDFa and Microdata.

The library presents a Graph interface which can be backed by any one of a number of Store implementations.

This core RDFLib package includes store implementations for in-memory storage and persistent storage on top of the Berkeley DB.

A SPARQL 1.1 implementation is included - supporting SPARQL 1.1 Queries and Update statements.

RDFLib is open source and is maintained on GitHub. RDFLib releases, current and previous are listed on PyPI

Multiple other projects are contained within the RDFlib "family", see https://github.com/RDFLib/.

Running tests

Running the tests on the host

Run the test suite with pytest.

poetry install
poetry run pytest

Running test coverage on the host with coverage report

Run the test suite and generate a HTML coverage report with pytest and pytest-cov.

poetry run pytest --cov

Viewing test coverage

Once tests have produced HTML output of the coverage report, view it by running:

poetry run pytest --cov --cov-report term --cov-report html
python -m http.server --directory=htmlcov

Contributing

RDFLib survives and grows via user contributions! Please read our contributing guide and developers guide to get started. Please consider lodging Pull Requests here:

https://github.com/RDFLib/rdflib/pulls

To get a development environment consider using Gitpod or Google Cloud Shell.

You can also raise issues here:

https://github.com/RDFLib/rdflib/issues

Support & Contacts

For general "how do I..." queries, please use https://stackoverflow.com and tag your question with rdflib. Existing questions:

https://stackoverflow.com/questions/tagged/rdflib

If you want to contact the rdflib maintainers, please do so via:

the rdflib-dev mailing list: https://groups.google.com/group/rdflib-dev
the chat, which is available at gitter or via matrix #RDFLib_rdflib:gitter.im

rdflib-hdt's People

Stargazers

Watchers

Forkers

xbodx decisym isabella232 divagnz westurner kellyjdutra glaserl dymil daniel-dona

rdflib-hdt's Issues

rdflib_hdt.optimize_sparql() produces incorrect results

Describe the bug

When querying in SPARQL mode, a query statement for which no variable binding can be found will bind to all triples in the graph. It should bind to none. This behavior appears after calling rdflib_hdt.optimize_sparql()

To Reproduce
Steps to reproduce the behavior:

Unzip all.hdt.zip
Run python3 hdt-test.py

Expected behavior
The last query should not return any results

System:

OS: Ubuntu 20.4L
Python 3.8.10
hdt 2.3
rdflib 6.0.1

Windows Installation

Hi,

Is it compatible with Windows? I have tried to install it through pip on Windowns 10 but I got some errors.

Best regards
Sirui

How to avoid high memory consumption?

I am having problems with queries being killed due to exceeding the available memory on my machine. The documentation says

Missing indexes are generated automatically, but be careful, as it requires to load all HDT triples in memory!

so I first thought that I might be running into this. But thinking about it, I don't see how a 70M HDT file could reasonably use up more than 10GB of RAM when loading it into memory. Is there a way to deal with such a situation other than just writing simpler queries? Any way as a user to find out where the problem comes from (in an SQL query plan I would look for things like large sort buffers)?

The query I'm using is

    SELECT ?lexentry ?other_form
    WHERE {
        ?lexentry a ontolex:LexicalEntry ;
                  ontolex:otherForm ?other_form .

        ?other_form ontolex:writtenRep ?other_written .
        OPTIONAL { ?lexentry lexinfo:partOfSpeech ?pos }

        OPTIONAL { ?other_form olia:hasMood ?mood }
        OPTIONAL { ?other_form olia:hasNumber ?number }
        OPTIONAL { ?other_form olia:hasPerson ?person }
        OPTIONAL { ?other_form olia:hasTense ?tense }
        OPTIONAL { ?other_form olia:hasVoice ?voice }

        OPTIONAL { ?other_form olia:hasCase ?case }
        OPTIONAL { ?other_form olia:hasInflectionType ?inflection }
        OPTIONAL { ?other_form olia:hasDefiniteness ?definiteness }
        OPTIONAL { ?other_form olia:hasGender ?gender }
    }

I can provide exact code and data to reproduce this, if it is of any help. I have executed the same query successfully on the same dataset in virtuoso.

Versions used:

Python 3.9.2
rdflib        6.2.0
rdflib-hdt    3.0

Querying multiple HDT graphs

This is more like a question: Is it possible to query with SPARQL over more than one HDT file or over an HDT file and a regular RDF graph? The points below apply only under the assumption that it is not (I didn't figure out how).

Is your feature request related to a problem? Please describe.
We use the WordNet HDT and several other HDT graphs with bilingual dictionaries. We want to query WordNet + dictionaries for getting (possible) synsets from WordNet (HDT1) for (say) Spanish words via a Spanish-English dictionary (HDT2).

Describe the solution you'd like
Treat multiple HDT files as single graphs and access them via SPARQL GRAPH (FROM, USING, etc.).
Downside is that HDT graphs are read-only (I guess), and if HDT graphs are freely mixed with writable graphs, users may be tempted to write into them and get frustrated from the results.

Describe alternatives you've considered
Overload the SERVICE keyword, i.e.

If an HDT file is read, allow users to assign it an identifier (let's call that "service URI")
When evaluating SPARQL queries, if a SERVICE is evoked, check whether it's a pre-registered service URI, if so, return the results of SELECT * {...} from the HDT file, otherwise evaluate using the standard implementation for SERVICE
This could probably be done by means of a SPARQL extension

In terms of ease of implementation and user experience, this may be the preferred solution, but I feel this is a bit of a hack and it will produce SPARQL queries that can probably not be ported to other HDT implementations (unless they adopt the same strategy).

Alternatives that don't apply
In theory, we could actually use the standard implementation of SERVICE, but that would require to set up one end point per HDT file, it would slow things down and create considerable overhead both in coding and communication. This may not be much of an issue if we're consulting just two HDT files, but we have plans to do that on a massively multilingual scale, so we might end up with dozens or hundreds of HDT files per query.

Additional context
That seems to be a feature of the Jena integration

Define a sparqleval plugin for BGP query opt

There is a recommendation to call optimize_sparql before doing SPARQL queries which "monkey-patches" the evalBGP method in RDFLib to use a more efficient implementation for HDT. This approach has the disadvantage that it requires an additional step for the user of the module where one should not be needed: There is a setuptools "entry point" defined for RDFLib SPARQL eval called rdf.plugins.sparqleval which should allow for a similar substitution without requiring a user of HDT to add the optimize_sparql call into their program initialization. If rdflib-hdt is meant to work in environments without setuptools/pkg_resources machinery, then the optimize_sparql method can still be modified to add to the CUSTOM_EVALS dictionary rather than modify evalBGP in rdflib.plugins.sparql which has the same effect as defining the entry point.

Can't install via pip

I try to install the lib on Windows 10 via pip install rdflib-hdt
I have Python 3.7 and new MS Build Tools with C++
While installing an error occured:
hdt-cpp-1.3.3/libcds/include/libcdsBasics.h(27): fatal error C1083: Cannot open include file: 'sys/resource.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.26.28801\bin\HostX86\x64\cl.exe' failed with exit status 2

Why did this error happen? And how to solve it?
I am not an c++ expert so I have no idea why is there no file 'sys/resource.h'

Tests are not running: E AttributeError: can't set attribute

Hi guys, thanks for working on this. RDFLib is great!

To Reproduce
Follow install instructions (install requirements etc)

Run tests python -m pytest tests

Expected behavior
Tests should pass

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: MacOS, pvenv, python 3.9

`============================= test session starts ==============================
platform darwin -- Python 3.9.9, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /workdir/RDFLib/rdflib-hdt
collected 29 items / 2 errors / 27 selected

==================================== ERRORS ====================================
___________________ ERROR collecting tests/hdt_store_test.py ___________________
tests/hdt_store_test.py:4: in
from rdflib_hdt import HDTStore, optimize_sparql
rdflib_hdt/init.py:12: in
from rdflib_hdt.sparql_op import optimize_sparql
rdflib_hdt/sparql_op.py:6: in
import rdflib.plugins.sparql.evaluate as sparql_evaluate
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/init.py:35: in
from . import parser
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/parser.py:182: in
Param('prefix', PN_PREFIX)) + Suppress(':').leaveWhitespace()
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/parserutils.py:113: in init
self.name = name
E AttributeError: can't set attribute
___________________ ERROR collecting tests/wrappers_test.py ____________________
tests/wrappers_test.py:3: in
from rdflib_hdt import HDTDocument
rdflib_hdt/init.py:12: in
from rdflib_hdt.sparql_op import optimize_sparql
rdflib_hdt/sparql_op.py:6: in
import rdflib.plugins.sparql.evaluate as sparql_evaluate
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/init.py:35: in
from . import parser
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/parser.py:182: in
Param('prefix', PN_PREFIX)) + Suppress(':').leaveWhitespace()
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/parserutils.py:113: in init
self.name = name
E AttributeError: can't set attribute
=============================== warnings summary ===============================
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/compat.py:8
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/compat.py:8
/workdir/RDFLib/rdflib-hdt/pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/compat.py:8: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
from collections import Mapping, MutableMapping # was added in 2.6

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================== short test summary info ============================
ERROR tests/hdt_store_test.py - AttributeError: can't set attribute
ERROR tests/wrappers_test.py - AttributeError: can't set attribute
!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!`

Save the rdflib graph as hdt format

Is it possible to load my RDF file with rdflib and convert it into an hdt document, and eventually save it as a file?

HDTStore.triples incorrectly has 'context' as positional argument

Describe the bug
When looping over the triples in a HDTStore object using the 'triples(pattern, context)' function, the context argument cannot be left out or given as 'None'. The description in README.md indicates that the context should be optional. Moreover, requiring only the pattern is the expected behaviour with the regular RDFlib graph stores.

To Reproduce

from rdflib import Graph
from rdflib_hdt import HDTStore
RDFLib Version: 5.0.0

g = HDTStore("data.hdt")
for s,p,o in g.triples((None, None, None)):
    continue

TypeError: triples() missing 1 required positional argument: 'context'

Expected behavior
The above code without error, i.e, make 'context' optional. Note that the following code, which is executed under the hood, works fine:

for s,p,o in g.hdt_document.search((None, None, None))[0]:
    continue

Filter triples

Hi,

How could I get triples whose predicates start with a string?
For example,
triples_sub, cardinality_sub = document.search_triples(entity, "http://www.wikidata.org/prop/direct/{}", "")
to get triples whose subject is entity and predicate starts with "http://www.wikidata.org/prop/direct/"

Add inference support (e.g., RDFS)

It would be useful to implement inference support that leverages the advantages of HDT indexes.

The rdflib package supports custom evaluation functions as documented at https://rdflib.readthedocs.io/en/stable/intro_to_sparql.html#custom-evaluation-functions and https://rdflib.readthedocs.io/en/stable/_modules/examples/custom_eval.html#customEval. However, a custom implementation of standard RDFS can be slow and miss opportunities for optimizations that HDT could provide.

Note that Issue #14 and the existing optimize_sparql() may be related.