Giter Club home page Giter Club logo

rdflib-hdt's Introduction

RDFLib

Build Status Documentation Status Coveralls branch

GitHub stars Downloads PyPI PyPI DOI

Contribute with Gitpod Gitter Matrix

RDFLib is a pure Python package for working with RDF. RDFLib contains most things you need to work with RDF, including:

  • parsers and serializers for RDF/XML, N3, NTriples, N-Quads, Turtle, TriX, Trig and JSON-LD
  • a Graph interface which can be backed by any one of a number of Store implementations
  • store implementations for in-memory, persistent on disk (Berkeley DB) and remote SPARQL endpoints
  • a SPARQL 1.1 implementation - supporting SPARQL 1.1 Queries and Update statements
  • SPARQL function extension mechanisms

RDFlib Family of packages

The RDFlib community maintains many RDF-related Python code repositories with different purposes. For example:

  • rdflib - the RDFLib core
  • sparqlwrapper - a simple Python wrapper around a SPARQL service to remotely execute your queries
  • pyLODE - An OWL ontology documentation tool using Python and templating, based on LODE.
  • pyrdfa3 - RDFa 1.1 distiller/parser library: can extract RDFa 1.1/1.0 from (X)HTML, SVG, or XML in general.
  • pymicrodata - A module to extract RDF from an HTML5 page annotated with microdata.
  • pySHACL - A pure Python module which allows for the validation of RDF graphs against SHACL graphs.
  • OWL-RL - A simple implementation of the OWL2 RL Profile which expands the graph with all possible triples that OWL RL defines.

Please see the list for all packages/repositories here:

Help with maintenance of all of the RDFLib family of packages is always welcome and appreciated.

Versions & Releases

See https://rdflib.dev for the release overview.

Documentation

See https://rdflib.readthedocs.io for our documentation built from the code. Note that there are latest, stable 5.0.0 and 4.2.2 documentation versions, matching releases.

Installation

The stable release of RDFLib may be installed with Python's package management tool pip:

$ pip install rdflib

Some features of RDFLib require optional dependencies which may be installed using pip extras:

$ pip install rdflib[berkeleydb,networkx,html,lxml]

Alternatively manually download the package from the Python Package Index (PyPI) at https://pypi.python.org/pypi/rdflib

The current version of RDFLib is 7.0.0, see the CHANGELOG.md file for what's new in this release.

Installation of the current main branch (for developers)

With pip you can also install rdflib from the git repository with one of the following options:

$ pip install git+https://github.com/rdflib/rdflib@main

or

$ pip install -e git+https://github.com/rdflib/rdflib@main#egg=rdflib

or from your locally cloned repository you can install it with one of the following options:

$ poetry install  # installs into a poetry-managed venv

or

$ pip install -e .

Getting Started

RDFLib aims to be a pythonic RDF API. RDFLib's main data object is a Graph which is a Python collection of RDF Subject, Predicate, Object Triples:

To create graph and load it with RDF data from DBPedia then print the results:

from rdflib import Graph
g = Graph()
g.parse('http://dbpedia.org/resource/Semantic_Web')

for s, p, o in g:
    print(s, p, o)

The components of the triples are URIs (resources) or Literals (values).

URIs are grouped together by namespace, common namespaces are included in RDFLib:

from rdflib.namespace import DC, DCTERMS, DOAP, FOAF, SKOS, OWL, RDF, RDFS, VOID, XMLNS, XSD

You can use them like this:

from rdflib import Graph, URIRef, Literal
from rdflib.namespace import RDFS, XSD

g = Graph()
semweb = URIRef('http://dbpedia.org/resource/Semantic_Web')
type = g.value(semweb, RDFS.label)

Where RDFS is the RDFS namespace, XSD the XML Schema Datatypes namespace and g.value returns an object of the triple-pattern given (or an arbitrary one if multiple exist).

Or like this, adding a triple to a graph g:

g.add((
    URIRef("http://example.com/person/nick"),
    FOAF.givenName,
    Literal("Nick", datatype=XSD.string)
))

The triple (in n-triples notation) <http://example.com/person/nick> <http://xmlns.com/foaf/0.1/givenName> "Nick"^^<http://www.w3.org/2001/XMLSchema#string> . is created where the property FOAF.givenName is the URI <http://xmlns.com/foaf/0.1/givenName> and XSD.string is the URI <http://www.w3.org/2001/XMLSchema#string>.

You can bind namespaces to prefixes to shorten the URIs for RDF/XML, Turtle, N3, TriG, TriX & JSON-LD serializations:

g.bind("foaf", FOAF)
g.bind("xsd", XSD)

This will allow the n-triples triple above to be serialised like this:

print(g.serialize(format="turtle"))

With these results:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

<http://example.com/person/nick> foaf:givenName "Nick"^^xsd:string .

New Namespaces can also be defined:

dbpedia = Namespace('http://dbpedia.org/ontology/')

abstracts = list(x for x in g.objects(semweb, dbpedia['abstract']) if x.language=='en')

See also ./examples

Features

The library contains parsers and serializers for RDF/XML, N3, NTriples, N-Quads, Turtle, TriX, JSON-LD, RDFa and Microdata.

The library presents a Graph interface which can be backed by any one of a number of Store implementations.

This core RDFLib package includes store implementations for in-memory storage and persistent storage on top of the Berkeley DB.

A SPARQL 1.1 implementation is included - supporting SPARQL 1.1 Queries and Update statements.

RDFLib is open source and is maintained on GitHub. RDFLib releases, current and previous are listed on PyPI

Multiple other projects are contained within the RDFlib "family", see https://github.com/RDFLib/.

Running tests

Running the tests on the host

Run the test suite with pytest.

poetry install
poetry run pytest

Running test coverage on the host with coverage report

Run the test suite and generate a HTML coverage report with pytest and pytest-cov.

poetry run pytest --cov

Viewing test coverage

Once tests have produced HTML output of the coverage report, view it by running:

poetry run pytest --cov --cov-report term --cov-report html
python -m http.server --directory=htmlcov

Contributing

RDFLib survives and grows via user contributions! Please read our contributing guide and developers guide to get started. Please consider lodging Pull Requests here:

To get a development environment consider using Gitpod or Google Cloud Shell.

Open in Gitpod Open in Cloud Shell

You can also raise issues here:

Support & Contacts

For general "how do I..." queries, please use https://stackoverflow.com and tag your question with rdflib. Existing questions:

If you want to contact the rdflib maintainers, please do so via:

rdflib-hdt's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rdflib-hdt's Issues

rdflib_hdt.optimize_sparql() produces incorrect results

Describe the bug

When querying in SPARQL mode, a query statement for which no variable binding can be found will bind to all triples in the graph. It should bind to none. This behavior appears after calling rdflib_hdt.optimize_sparql()

To Reproduce
Steps to reproduce the behavior:

  1. Unzip all.hdt.zip
  2. Run python3 hdt-test.py

Expected behavior
The last query should not return any results

System:

  • OS: Ubuntu 20.4L
  • Python 3.8.10
  • hdt 2.3
  • rdflib 6.0.1

Windows Installation

Hi,

Is it compatible with Windows? I have tried to install it through pip on Windowns 10 but I got some errors.

Best regards
Sirui

How to avoid high memory consumption?

I am having problems with queries being killed due to exceeding the available memory on my machine. The documentation says

Missing indexes are generated automatically, but be careful, as it requires to load all HDT triples in memory!

so I first thought that I might be running into this. But thinking about it, I don't see how a 70M HDT file could reasonably use up more than 10GB of RAM when loading it into memory. Is there a way to deal with such a situation other than just writing simpler queries? Any way as a user to find out where the problem comes from (in an SQL query plan I would look for things like large sort buffers)?

The query I'm using is

    SELECT ?lexentry ?other_form
    WHERE {
        ?lexentry a ontolex:LexicalEntry ;
                  ontolex:otherForm ?other_form .

        ?other_form ontolex:writtenRep ?other_written .
        OPTIONAL { ?lexentry lexinfo:partOfSpeech ?pos }

        OPTIONAL { ?other_form olia:hasMood ?mood }
        OPTIONAL { ?other_form olia:hasNumber ?number }
        OPTIONAL { ?other_form olia:hasPerson ?person }
        OPTIONAL { ?other_form olia:hasTense ?tense }
        OPTIONAL { ?other_form olia:hasVoice ?voice }

        OPTIONAL { ?other_form olia:hasCase ?case }
        OPTIONAL { ?other_form olia:hasInflectionType ?inflection }
        OPTIONAL { ?other_form olia:hasDefiniteness ?definiteness }
        OPTIONAL { ?other_form olia:hasGender ?gender }
    }

I can provide exact code and data to reproduce this, if it is of any help. I have executed the same query successfully on the same dataset in virtuoso.

Versions used:

Python 3.9.2
rdflib        6.2.0
rdflib-hdt    3.0

Querying multiple HDT graphs

This is more like a question: Is it possible to query with SPARQL over more than one HDT file or over an HDT file and a regular RDF graph? The points below apply only under the assumption that it is not (I didn't figure out how).

Is your feature request related to a problem? Please describe.
We use the WordNet HDT and several other HDT graphs with bilingual dictionaries. We want to query WordNet + dictionaries for getting (possible) synsets from WordNet (HDT1) for (say) Spanish words via a Spanish-English dictionary (HDT2).

Describe the solution you'd like
Treat multiple HDT files as single graphs and access them via SPARQL GRAPH (FROM, USING, etc.).
Downside is that HDT graphs are read-only (I guess), and if HDT graphs are freely mixed with writable graphs, users may be tempted to write into them and get frustrated from the results.

Describe alternatives you've considered
Overload the SERVICE keyword, i.e.

  • If an HDT file is read, allow users to assign it an identifier (let's call that "service URI")
  • When evaluating SPARQL queries, if a SERVICE is evoked, check whether it's a pre-registered service URI, if so, return the results of SELECT * {...} from the HDT file, otherwise evaluate using the standard implementation for SERVICE
  • This could probably be done by means of a SPARQL extension

In terms of ease of implementation and user experience, this may be the preferred solution, but I feel this is a bit of a hack and it will produce SPARQL queries that can probably not be ported to other HDT implementations (unless they adopt the same strategy).

Alternatives that don't apply
In theory, we could actually use the standard implementation of SERVICE, but that would require to set up one end point per HDT file, it would slow things down and create considerable overhead both in coding and communication. This may not be much of an issue if we're consulting just two HDT files, but we have plans to do that on a massively multilingual scale, so we might end up with dozens or hundreds of HDT files per query.

Additional context
That seems to be a feature of the Jena integration

Define a sparqleval plugin for BGP query opt

There is a recommendation to call optimize_sparql before doing SPARQL queries which "monkey-patches" the evalBGP method in RDFLib to use a more efficient implementation for HDT. This approach has the disadvantage that it requires an additional step for the user of the module where one should not be needed: There is a setuptools "entry point" defined for RDFLib SPARQL eval called rdf.plugins.sparqleval which should allow for a similar substitution without requiring a user of HDT to add the optimize_sparql call into their program initialization. If rdflib-hdt is meant to work in environments without setuptools/pkg_resources machinery, then the optimize_sparql method can still be modified to add to the CUSTOM_EVALS dictionary rather than modify evalBGP in rdflib.plugins.sparql which has the same effect as defining the entry point.

Can't install via pip

I try to install the lib on Windows 10 via pip install rdflib-hdt
I have Python 3.7 and new MS Build Tools with C++
While installing an error occured:
hdt-cpp-1.3.3/libcds/include/libcdsBasics.h(27): fatal error C1083: Cannot open include file: 'sys/resource.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.26.28801\bin\HostX86\x64\cl.exe' failed with exit status 2

Why did this error happen? And how to solve it?
I am not an c++ expert so I have no idea why is there no file 'sys/resource.h'

Tests are not running: E AttributeError: can't set attribute

Hi guys, thanks for working on this. RDFLib is great!

To Reproduce
Follow install instructions (install requirements etc)

Run tests python -m pytest tests

Expected behavior
Tests should pass

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: MacOS, pvenv, python 3.9

`============================= test session starts ==============================
platform darwin -- Python 3.9.9, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /workdir/RDFLib/rdflib-hdt
collected 29 items / 2 errors / 27 selected

==================================== ERRORS ====================================
___________________ ERROR collecting tests/hdt_store_test.py ___________________
tests/hdt_store_test.py:4: in
from rdflib_hdt import HDTStore, optimize_sparql
rdflib_hdt/init.py:12: in
from rdflib_hdt.sparql_op import optimize_sparql
rdflib_hdt/sparql_op.py:6: in
import rdflib.plugins.sparql.evaluate as sparql_evaluate
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/init.py:35: in
from . import parser
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/parser.py:182: in
Param('prefix', PN_PREFIX)) + Suppress(':').leaveWhitespace()
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/parserutils.py:113: in init
self.name = name
E AttributeError: can't set attribute
___________________ ERROR collecting tests/wrappers_test.py ____________________
tests/wrappers_test.py:3: in
from rdflib_hdt import HDTDocument
rdflib_hdt/init.py:12: in
from rdflib_hdt.sparql_op import optimize_sparql
rdflib_hdt/sparql_op.py:6: in
import rdflib.plugins.sparql.evaluate as sparql_evaluate
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/init.py:35: in
from . import parser
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/parser.py:182: in
Param('prefix', PN_PREFIX)) + Suppress(':').leaveWhitespace()
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/parserutils.py:113: in init
self.name = name
E AttributeError: can't set attribute
=============================== warnings summary ===============================
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/compat.py:8
pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/compat.py:8
/workdir/RDFLib/rdflib-hdt/pyhdt_venv/lib/python3.9/site-packages/rdflib/plugins/sparql/compat.py:8: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
from collections import Mapping, MutableMapping # was added in 2.6

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================== short test summary info ============================
ERROR tests/hdt_store_test.py - AttributeError: can't set attribute
ERROR tests/wrappers_test.py - AttributeError: can't set attribute
!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!`

HDTStore.triples incorrectly has 'context' as positional argument

Describe the bug
When looping over the triples in a HDTStore object using the 'triples(pattern, context)' function, the context argument cannot be left out or given as 'None'. The description in README.md indicates that the context should be optional. Moreover, requiring only the pattern is the expected behaviour with the regular RDFlib graph stores.

To Reproduce

from rdflib import Graph
from rdflib_hdt import HDTStore
RDFLib Version: 5.0.0

g = HDTStore("data.hdt")
for s,p,o in g.triples((None, None, None)):
    continue

TypeError: triples() missing 1 required positional argument: 'context'

Expected behavior
The above code without error, i.e, make 'context' optional. Note that the following code, which is executed under the hood, works fine:

for s,p,o in g.hdt_document.search((None, None, None))[0]:
    continue

Filter triples

Hi,

How could I get triples whose predicates start with a string?
For example,
triples_sub, cardinality_sub = document.search_triples(entity, "http://www.wikidata.org/prop/direct/{}", "")
to get triples whose subject is entity and predicate starts with "http://www.wikidata.org/prop/direct/"

Add inference support (e.g., RDFS)

It would be useful to implement inference support that leverages the advantages of HDT indexes.

The rdflib package supports custom evaluation functions as documented at https://rdflib.readthedocs.io/en/stable/intro_to_sparql.html#custom-evaluation-functions and https://rdflib.readthedocs.io/en/stable/_modules/examples/custom_eval.html#customEval. However, a custom implementation of standard RDFS can be slow and miss opportunities for optimizations that HDT could provide.

Note that Issue #14 and the existing optimize_sparql() may be related.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.