Giter Club home page Giter Club logo

rdfhash's Introduction

rdfhash: RDF Graph Hashing/Compression Tool

rdfhash is a utility for RDF graph compression that works by hashing RDF subjects based on a checksum of their triples, effectively minimizing the size of RDF graphs by consolidating subjects that have identical definitions.

Installation

You can install rdfhash using pip, a package manager for Python. Ensure python and pip are properly installed on your system, then run the following command:

pip install rdfhash

Usage

Command Line Interface (CLI)

Basic Usage

By default, all blank nodes in a text/turtle file or string are replaced by their hashed definition:

rdfhash '
@prefix hash: <http://rdfhash.com/ontology/> .

[ ] a hash:Attribute ;
    hash:unit hash:unit:Centimeters ;
    hash:value 5.38 .'

Output:

@prefix hash: <http://rdfhash.com/ontology/> .

<sha256:960891b4b1856b4d2c24b977f75d497e4da9e6f147a292524ae51db5fd0e864e> 
    a hash:Attribute ;
    hash:unit <http://rdfhash.com/ontology/unit:Centimeters> ;
    hash:value 5.38 .

Advanced Usage

The rdfhash tool is highly customizable and can be tailored to fit the requirements of any organization:

rdfhash '
@prefix hash: <http://rdfhash.com/ontology/> .
@prefix md5: <http://rdfhash.com/instances/md5/> .

[ ] a hash:Contact ;
    hash:phone "487-538-2824" ;
    hash:email "[email protected]" ;
    hash:name [ 
        a hash:LegalName ;
        hash:firstName "John" ;
        hash:lastName "Smith" ;
    ] ;
    hash:address [ 
        a hash:Address ;
        hash:street "4567 Mountain Peak Way" ;
        hash:city "Denver" ;
        hash:state "CO" ;
        hash:zip "80202" ;
        hash:country "USA" ;
    ] ;
.' \
--method md5 \
--template 'http://rdfhash.com/instances/{method}/{value}' \
--sparql '
prefix hash: <http://rdfhash.com/ontology/>
select ?s where { 
    ?s a ?type . 
    VALUES ?type {
        hash:Contact
        hash:LegalName
        hash:Address
    }
}'
  • --method specifies the hashing algorithm to use. The default is sha256.
  • --template specifies the URI template to use for hashed subjects. The default is {method}:{value}.
  • --sparql specifies the SPARQL query to use for selecting subjects to hash. The default is SELECT ?s WHERE { ?s ?p ?o . FILTER(isBlank(?s))} (Selecting all Blank Node subjects).
  • Run rdfhash --help for more information on available parameters.

Output:

@prefix hash: <http://rdfhash.com/ontology/> .
@prefix md5: <http://rdfhash.com/instances/md5/> .

md5:8fc18e400ff531e5cbe02fef751662ba 
    a hash:Contact ;
    hash:phone "487-538-2824" ;
    hash:email "[email protected]" ;
    hash:name md5:5fd42f2c072c80e3db760c3fc69b91b8 ;
    hash:address md5:9a3e3ce644e2c5271015d9665675a8e5 .

md5:5fd42f2c072c80e3db760c3fc69b91b8 
    a hash:LegalName ;
    hash:firstName "John" ;
    hash:lastName "Smith" .

md5:9a3e3ce644e2c5271015d9665675a8e5 
    a hash:Address ;
    hash:street "4567 Mountain Peak Way" ;
    hash:city "Denver" ;
    hash:state "CO" ;
    hash:zip "80202" ;
    hash:country "USA" .

Import as a Python Module

from rdfhash import hash_subjects

data = '''
@prefix hash: <http://rdfhash.com/ontology/> .
@prefix sha1: <http://rdfhash.com/instances/sha1/> .

<http://rdfhash.com/instances/Meaning-of-Life>
    a hash:Attribute ;
    hash:value 42 .
'''

graph, subjects_replaced = hash_subjects(
    data,
    method='sha1',
    template='http://rdfhash.com/instances/{method}/{value}',
    sparql_select_subjects='''
    prefix hash: <http://rdfhash.com/ontology/>
    SELECT ?s WHERE { ?s a hash:Attribute. }
    '''
)

print(graph.serialize(format='turtle'))

Output:

@prefix hash: <http://rdfhash.com/ontology/> .
@prefix sha1: <http://rdfhash.com/instances/sha1/> .

sha1:4afe716d630b17d5a5d06f0901800e16f3e8c9a4
    a hash:Attribute ;
    hash:value 42 .

Limitations

It's important to note where rdfhash is limited in its functionality. These limitations are expected to be addressed in future versions.

  • The rdfhash tool does not yet fully support Named Graphs (e.g. text/trig or application/n-quads)
    • Users can still attempt to pass RDF data containing Named Graphs, although the expected output has not yet been tested.
  • Circular dependencies between selected subjects are currently not allowed. (e.g. Inverse properties). A Directed Acyclic Graph (DAG) is required at the moment.
    • Best practice to follow is prioritizing broader-to-narrower relationships. (e.g. A person Contact points to LegalName and Address and not inversely. Multiple contacts can point to the same LegalName or Address.)
    • Future rdfhash versions will support ignoring specific properties used in a subject's hash, allowing the use of inverse properties.
  • Currently, selected subjects are expected to be fully defined in the input graph.
    • Future rdfhash versions will support connections to a SPARQL endpoint to fetch full context for hashing.

rdfhash's People

Contributors

neilgraham avatar

Stargazers

 avatar Jonny Saunders avatar  avatar Ben Damman avatar Jasper avatar  avatar Mark Wallace avatar Matteo Bianchetti avatar  avatar

Watchers

James Cloos avatar  avatar

rdfhash's Issues

Add feature for skolemization

This hashes the data and condenses, but as Rebecca brought up, a simple thing that could expand functionality is skolemization.

New Feature: `sparql` argument, SELECT multiple variables

New Feature: sparql argument, SELECT multiple variables

Branch: feature/sparql-multiple-variables

Allow rdfhash command to hash multiple subjects by SELECTing multiple variables via the sparql argument.

$ rdfhash --sparql '
SELECT ?person ?parent ?child
WHERE {
    ?person
        a <Person>  ;
        <dateOfBirth> "1999-5-20"^^xsd:date ;
        <hasChild> ?child  ;
        <hasParent> ?parent  ;
        .
}
'
  • At the moment, only values in ?person would have their subjects hashed.
  • With the feature implemented, ?parent and ?child values would also have their subjects hashed.

New Feature: Add `uuid` and `ulid` methods

New Feature: Add uuid and ulid methods

Allow for the ability to replace selected subjects with Unique Identifiers such as uuid and ulid.

<ulid:01GJCBQBSH99KGER8TR44BP51R>
    a <Person> ;
    <hasID> "kreeves" ;
    <hasOccupation> <Actor> ;
    .

<ulid:01GJCBQBSH99KGER8TR44BP51R> 
    a <Person> ;
    <hasID> "mdamon" ;
    <hasOccupation> <Actor> ;
    .

Task

  • Add the following method argument options:
    • uuid1, uuid3, uuid4, uuid5, ulid
  • When using UUID methods, do not throw circular dependency errors.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.