Giter Club home page Giter Club logo

json2rdf's Introduction

JSON2RDF

Streaming generic JSON to RDF converter

Reads JSON data and streams N-Triples output. The conversion algorithm is similar to that of JSON-LD but accepts arbitrary JSON and does not require a @context.

The resulting RDF representation is lossless with the exception of array ordering and some datatype round-tripping. The lost ordering should not be a problem in the majority of cases, as RDF applications tend to impose their own value-based ordering using SPARQL ORDER BY.

A common use case is feeding the JSON2RDF output into a triplestore or SPARQL processor and using a SPARQL CONSTRUCT query to map the generic RDF to more specific RDF that uses terms from some vocabulary. SPARQL is an inherently more flexible RDF mapping mechanism than JSON-LD @context.

Build

mvn clean install

That should produce an executable JAR file target/json2rdf-jar-with-dependencies.jar in which dependency libraries will be included.

Maven

Each version is released to the Maven central repository as com.atomgraph.etl.json/json2rdf

Usage

The JSON data is read from stdin, the resulting RDF data is written to stdout.

JSON2RDF is available as a .jar as well as a Docker image atomgraph/json2rdf (recommended).

Parameters:

  • base - the base URI for the data. Property namespace is constructed by adding # to the base URI.

Options:

  • --input-charset - JSON input encoding, by default UTF-8
  • --output-charset - RDF output encoding, by default UTF-8

Examples

JSON2RDF output is streaming and produces N-Triples, therefore we pipe it through riot to get a more readable Turtle output.


Bob DuCharme's blog post on using JSON2RDF: Converting JSON to RDF.


JSON data in ordinary-json-document.json

{
  "name": "Markus Lanthaler",
  "homepage": "http://www.markus-lanthaler.com/",
  "image": "http://twitter.com/account/profile_image/markuslanthaler"
}

Java execution from shell:

cat ordinary-json-document.json | java -jar json2rdf-jar-with-dependencies.jar https://localhost/ | riot --formatted=TURTLE

Alternatively, Docker execution from shell:

cat ordinary-json-document.json | docker run --rm -i -a stdin -a stdout -a stderr atomgraph/json2rdf https://localhost/ | riot --formatted=TURTLE

Note that using Docker you need to bind stdin/stdout/stderr streams.

Turtle output

[ <https://localhost/#homepage>  "http://www.markus-lanthaler.com/" ;
  <https://localhost/#image>     "http://twitter.com/account/profile_image/markuslanthaler" ;
  <https://localhost/#name>      "Markus Lanthaler"
] .

The following SPARQL query can be used to map this generic RDF to the desired target RDF, e.g. a structure that uses schema.org vocabulary.

BASE <https://localhost/>
PREFIX : <#>
PREFIX schema: <http://schema.org/>

CONSTRUCT
{
  ?person schema:homepage ?homepage ;
    schema:image ?image ;
    schema:name ?name .
}
{
  ?person :homepage ?homepageStr ;
    :image ?imageStr ;
    :name ?name .
  BIND (URI(?homepageStr) AS ?homepage)
  BIND (URI(?imageStr) AS ?image)
}

Turtle output after the mapping

[ <http://schema.org/homepage>  <http://www.markus-lanthaler.com/> ;
  <http://schema.org/image>     <http://twitter.com/account/profile_image/markuslanthaler> ;
  <http://schema.org/name>      "Markus Lanthaler"
] .

JSON data in city-distances.json

{
  "desc"    : "Distances between several cities, in kilometers.",
  "updated" : "2014-02-04T18:50:45",
  "uptodate": true,
  "author"  : null,
  "cities"  : {
    "Brussels": [
      {"to": "London",    "distance": 322},
      {"to": "Paris",     "distance": 265},
      {"to": "Amsterdam", "distance": 173}
    ],
    "London": [
      {"to": "Brussels",  "distance": 322},
      {"to": "Paris",     "distance": 344},
      {"to": "Amsterdam", "distance": 358}
    ],
    "Paris": [
      {"to": "Brussels",  "distance": 265},
      {"to": "London",    "distance": 344},
      {"to": "Amsterdam", "distance": 431}
    ],
    "Amsterdam": [
      {"to": "Brussels",  "distance": 173},
      {"to": "London",    "distance": 358},
      {"to": "Paris",     "distance": 431}
    ]
  }
}

Java execution from shell:

cat city-distances.json | java -jar json2rdf-jar-with-dependencies.jar https://localhost/ | riot --formatted=TURTLE

Alternatively, Docker execution from shell:

cat city-distances.json | docker run --rm -i -a stdin -a stdout -a stderr atomgraph/json2rdf https://localhost/ | riot --formatted=TURTLE

Turtle output

[ <https://localhost/#cities>    [ <https://localhost/#Amsterdam>  [ <https://localhost/#distance>  "431"^^<http://www.w3.org/2001/XMLSchema#int> ;
                                                                     <https://localhost/#to>        "Paris"
                                                                   ] ;
                                   <https://localhost/#Amsterdam>  [ <https://localhost/#distance>  "358"^^<http://www.w3.org/2001/XMLSchema#int> ;
                                                                     <https://localhost/#to>        "London"
                                                                   ] ;
                                   <https://localhost/#Amsterdam>  [ <https://localhost/#distance>  "173"^^<http://www.w3.org/2001/XMLSchema#int> ;
                                                                     <https://localhost/#to>        "Brussels"
                                                                   ] ;
                                   <https://localhost/#Brussels>   [ <https://localhost/#distance>  "322"^^<http://www.w3.org/2001/XMLSchema#int> ;
                                                                     <https://localhost/#to>        "London"
                                                                   ] ;
                                   <https://localhost/#Brussels>   [ <https://localhost/#distance>  "265"^^<http://www.w3.org/2001/XMLSchema#int> ;
                                                                     <https://localhost/#to>        "Paris"
                                                                   ] ;
                                   <https://localhost/#Brussels>   [ <https://localhost/#distance>  "173"^^<http://www.w3.org/2001/XMLSchema#int> ;
                                                                     <https://localhost/#to>        "Amsterdam"
                                                                   ] ;
                                   <https://localhost/#London>     [ <https://localhost/#distance>  "358"^^<http://www.w3.org/2001/XMLSchema#int> ;
                                                                     <https://localhost/#to>        "Amsterdam"
                                                                   ] ;
                                   <https://localhost/#London>     [ <https://localhost/#distance>  "322"^^<http://www.w3.org/2001/XMLSchema#int> ;
                                                                     <https://localhost/#to>        "Brussels"
                                                                   ] ;
                                   <https://localhost/#London>     [ <https://localhost/#distance>  "344"^^<http://www.w3.org/2001/XMLSchema#int> ;
                                                                     <https://localhost/#to>        "Paris"
                                                                   ] ;
                                   <https://localhost/#Paris>      [ <https://localhost/#distance>  "431"^^<http://www.w3.org/2001/XMLSchema#int> ;
                                                                     <https://localhost/#to>        "Amsterdam"
                                                                   ] ;
                                   <https://localhost/#Paris>      [ <https://localhost/#distance>  "344"^^<http://www.w3.org/2001/XMLSchema#int> ;
                                                                     <https://localhost/#to>        "London"
                                                                   ] ;
                                   <https://localhost/#Paris>      [ <https://localhost/#distance>  "265"^^<http://www.w3.org/2001/XMLSchema#int> ;
                                                                     <https://localhost/#to>        "Brussels"
                                                                   ]
                                 ] ;
  <https://localhost/#desc>      "Distances between several cities, in kilometers." ;
  <https://localhost/#updated>   "2014-02-04T18:50:45" ;
  <https://localhost/#uptodate>  true
] .

Mapping Twitter export to RDF

You can download your Twitter data which includes tweets in tweets.js. Remove the window.YTD.tweets.part0 = string and save the rest as tweets.json.

To get the RDF output, save the following query as tweets.rq

BASE            <https://twitter.com/>
PREFIX :        <#>
PREFIX xsd:     <http://www.w3.org/2001/XMLSchema#>
PREFIX sioc:    <http://rdfs.org/sioc/ns#>
PREFIX dct:     <http://purl.org/dc/terms/>

CONSTRUCT
{
    ?tweet a sioc:Post ;
        sioc:id ?id ;
        dct:created ?created ;
        sioc:content ?content ;
        sioc:reply_of ?reply_of .
}
{
    ?tweet_obj :id ?id ;
        :created_at ?created_at_string ;
        :full_text ?content .
    OPTIONAL
    {
        ?tweet_obj :in_reply_to_status_id ?in_reply_to_status_id ;
            :in_reply_to_screen_name ?in_reply_to_screen_name .
        BIND(URI(CONCAT(?in_reply_to_screen_name, "/status/", ?in_reply_to_status_id)) AS ?reply_of)
    }

    BIND("atomgraphhq" AS ?username)
    BIND(URI(CONCAT(?username, "/status/", ?id)) AS ?tweet)
    BIND(SUBSTR(?created_at_string, 27, 4) AS ?year_string)
    BIND(SUBSTR(?created_at_string, 5, 3) AS ?month_string)
    BIND(SUBSTR(?created_at_string, 9, 2) AS ?day_string)
    VALUES (?month_string ?month_number_string)
    {
         ("Jan"    "01")
         ("Feb"    "02")
         ("Mar"    "03")
         ("Apr"    "04")
         ("May"    "05")
         ("Jun"    "06")
         ("Jul"    "07")
         ("Aug"    "08")
         ("Sep"    "09")
         ("Oct"    "10")
         ("Nov"    "11")
         ("Dec"    "12")
    }
    BIND(SUBSTR(?created_at_string, 12, 8) AS ?time)
    BIND(SUBSTR(?created_at_string, 21, 3) AS ?tz_hours)
    BIND(SUBSTR(?created_at_string, 24, 2) AS ?tz_minutes)
    BIND(STRDT(CONCAT(?year_string, "-", ?month_number_string, "-", ?day_string, "T", ?time, ?tz_hours, ":", ?tz_minutes), xsd:dateTime) AS ?created)
}

adjust your Twitter handle in the query string as ?username, and then run this command:

cat tweets.json | docker run --rm -i -a stdin -a stdout -a stderr atomgraph/json2rdf https://twitter.com/ > tweets.nt && \
    sparql --data tweets.nt --query tweets.rq > tweets.ttl

Output sample:

<https://twitter.com/atomgraphhq/status/1535239790693699587>
        a              sioc:Post ;
        dct:created    "2022-06-10T12:37:44+00:00"^^xsd:dateTime ;
        sioc:content   "Follow it on GitHub!\nhttps://t.co/pu5KkOoIOX" ;
        sioc:id        "1535239790693699587" ;
        sioc:reply_of  <https://twitter.com/atomgraphhq/status/1535211486582382593> .

Improvements to the mapping query are welcome.

Performance

Largest dataset tested so far: 2.95 GB / 30459482 lines of JSON to 4.5 GB / 21964039 triples in 2m10s. Hardware: x64 Windows 10 PC with Intel Core i5-7200U 2.5 GHz CPU and 16 GB RAM.

Dependencies

json2rdf's People

Contributors

jetztgradnet avatar namedgraph avatar tmciver avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

json2rdf's Issues

whitespace in json keys

characters.json:

{
  "characters": [
    {
      "first name": "Ash",
      "lastname":"Ketchum"}
      ]
    }


is valid json.

but:

cat characters.json | docker run --rm -i atomgraph/json2rdf 'http://tmp.com/
Exception in thread "main" org.apache.jena.riot.RiotException: <http://tmp.com/#first name> Code: 17/WHITESPACE in FRAGMENT: A single whitespace character. These match no grammar rules of URIs/IRIs. These characters are permitted in RDF URI References, XML system identifiers, and XML Schema anyURIs.
        at org.apache.jena.riot.system.IRIResolver.exceptions(IRIResolver.java:384)
        at org.apache.jena.riot.system.IRIResolver.resolve(IRIResolver.java:341)
        at org.apache.jena.riot.system.IRIResolver.resolveToString(IRIResolver.java:358)
        at com.atomgraph.etl.json.JsonStreamRDFWriter.write(JsonStreamRDFWriter.java:107)
        at com.atomgraph.etl.json.JsonStreamRDFWriter.convert(JsonStreamRDFWriter.java:66)
        at com.atomgraph.etl.json.JSON2RDF.convert(JSON2RDF.java:82)
        at com.atomgraph.etl.json.JSON2RDF.main(JSON2RDF.java:60)

should json2rdf just make the decision to delete non URI safe characters or maybe percent encode them?

Incorrect version number used in Docker container

When running JSON2RDF in the Docker container is shown in the readme, I get this error message:

xxxx [~/dev/workspaces/graphstuff/JSON2RDF]> echo "{'label':'Hello'}" | head -100 | docker run -i -a stdin -a stdout -a stderr atomgraph/json2rdf urn:test:
Error: Unable to access jarfile target/json2rdf-1.0.0-SNAPSHOT-jar-with-dependencies.jar

The version number in the Dockerfile wasn't adjusted when bumping the version to 1.0.1.

Adding a Apache Spark UDF?

Hi! Would it make sense to have a small addition that makes the library usable in Apache Spark? Something along the lines of

package com.atomgraph.etl.json;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.jena.rdf.model.Model;
public class Json2rdfUDF implements UDF1<String, String> {
private static final long serialVersionUID = 1L;
@Override
  public StreamRDF call(String jsonString) throws Exception {

       InputStream bis = new ByteArrayInputStream(jsonString.getBytes());
       Reader reader =  new BufferedReader(bis);

       StreamRDF rdfStream = new CollectorStreamRDF();
       new JsonStreamRDFWriter(reader, rdfStream, baseURI.toString()).convert();
       
       return rdfStream;
   }
}

# for namespace properties should be optional

From the README: "Property namespace is constructed by adding # to the base URI."

If the base URI ends in a forward slash (/), a pound/number sign (#) should not be added to the base URI.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.