Giter Club home page Giter Club logo

dutch-case-law-to-couchdb's Introduction

Dutch case law as linked open data

This repository contains code that clones Dutch case law XML documents from data.rechtspraak.nl and converts the metadata to JSON. This makes the documents easier to process with a machine. Because we store the documents in a CouchDB database, this makes the data instantly amenable to MapReduce jobs.

Everything described here is a derivative of the rechtspraak.nl web service. Original documentation is is available in Dutch here.

Document API

Use this API to get Dutch case law documents. Documents are available in XML and HTML.

XML documents are retrieved using the URL scheme https://rechtspraak.cloudant.com/docs/{ECLI identifier}/data.xml. Example.

HTML snippets are retrieved using the URL scheme https://rechtspraak.cloudant.com/docs/{ECLI identifier}/data.htm to get a HTML snippet. Example.

Metadata

Document metadata is retrieved using the URL scheme https://rechtspraak.cloudant.com/docs/{ECLI identifier}. Example.

Rechtspraak.nl contains metadata for judgments in XML/RDF, but the RDF is actually not well formed. XML is also arguably a bad format for metadata. In any case, CouchDB describes documents in JSON, so metadata is converted in JSON. In order to be RDF-compatible, we ashere to the JSON-LD format.

Also, some additional metadata fields are generated. There are also a bunch of MapReduce jobs defined on the data.

Metadata format

An attempt to correct metadata issues is undertaken in the cloning process. Document metadata is described in JSON format, respecting RDF triples through JSON-LD.

In the following table, all metadata fields are presented, and some guarantees are made about their JSON structure. We make some assumptions about the RDF triples that Rechtspraak.nl provides that are not strictly necessary, but makes the data easier to work with. Also, some values merit some extra processing in order to keep our RDF consistent.

Some fields are uncommon, so for each field we provide a link to an example document which uses that field. Visit ecli/_design/query_dev/_view/docs_with_field?group_level=2&startkey=[<field name>& endkey=[<field name>,] to see which documents contain that particular field.

Global metadata

These fields may appear either in the block for document metadata or the block for register metadata, and we assume they are the same in both.

Tag name / JSON field JSON value Description
dcterms:accessRights String (relative URI) Fixed to 'public'. Some manifestations may be non-public, like ones with their names unredacted, but we don't have access to those.
dcterms:publisher Object (resource) Court. Assumed to be a single object.
dcterms:title String (literal) Document title. Most often, this is a concatenation of the ECLI number with the court name and date.
dcterms:language String (resource URI) Fixed to 'nl'.
dcterms:abstract String (literal) Short summary. We do not include abstracts that consist of a single dash, because they are uninformative.
dcterms:replaces String (literal) LJN number which this ECLI replaces
dcterms:isReplacedBy String (literal) If the current ECLI is not valid, this points to a replacement ECLI. Note this is only about the identifier. <a href>Doesn't seem to be used in practice.</a>
dcterms:contributor Array of objects () Supposedly denotes the judge. We would like to extend this to also include other entities such as lawyers . We may reify these links to denote the roles these people have in the case.<a href="https://rechtspraak.cloudant.com/ecli/_design/query_dev/_view/docs_with_field?stale=ok&limit=100&group_level=2&startkey=[%22dcterms:contributor%22]&endkey=[%22dcterms:contributor\ufff0%22]\">Doesnt seem to be used in practice.</a>
dcterms:date String (literal) Date of judgment
dcterms:alternative Array of strings (literal)
psi:procedure List of objects (resources) Aliases / alternative titles. <a href="https://rechtspraak.cloudant.com/ecli/_design/query_dev/_view/docs_with_field?stale=ok&limit=100&group_level=1&startkey=[%22dcterms:alternative%22]&endkey=[%22dcterms:alternative\ufff0%22]">Doesnt seem to be used in practice.</a>
psi:procedure List of objects (resources) What kind of procedure this case is (e.g., 'appeal'). Rechtspraak.nl XML assigns the label 'Procedure' to this tag using a <code>rdfs:label</code> predicate. To fully represent this in RDF, we should reify this triple. But to keep our document readable, we assign a JSON-LD alias from <code>Procedure</code> to <code>psi:procedure</code> in <code>@context</code>.
dcterms:creator Object (resource) Object (resource). Note that we assume a cardinality of 1: behaviour is not defined for multiple<code>dcterms:creator</code> tags. <td>Court in which this judgment was made. <strong>NOTE:</strong> psi:afdeling is deprecated, so we won't parse it </td>
dcterms:type Object (resource) Represents either 'Uitspraak' or 'Conclusie' ('judgment' or 'conclusion').
dcterms:temporal Object (resource) Indicates a timespan between which the case must be judged, which may happen for example in tax law.
dcterms:references Array of objects (resources) These triple have additional data; what <em>kind</em> of reference is this? These should be reified on the triple, but we just add a <code>referenceType</code>field to the referent object.<strong>NOTE:</strong>Discussed whether this should this references an *expression* of a law, because it refers to the law at a particular time (usually the time of the court case). I don't resolve the expression because we can't know with full certainty to what time it refers. It's rechtspraak.nl's responsibility to get the reference right anyway.
dcterms:coverage Array of objects (resources) The jurisdiction to which this judgment is relevant
dcterms:hasVersion Array of objects (resources) Where versions of this judgment can be found. Might be different expressions (e.g., edited and annotated)
dcterms:relation Array of objects (reified statements) Relations to other cases. <cite><a href="http://dublincore.org/documents/dcmi-terms/#terms-relation">Dublin Core specification</a></cite> specifies: <blockquote> "Recommended best practice is to identify the related resource by means of a string conforming to a formal identification system" </blockquote> <strong>NOTE:</strong>this relation is reified so that we can make meta-statements about it. See <a href="http://stackoverflow.com/questions/5671227/ddg#5671407">stackoverflow.com/questions/5671227/ddg#5671407</a>. This might not follow dcterms best practices.
psi:zaaknummer Array of strings (literal) Existing case numbers
dcterms:subject Array of objects (resource) What kind of law this case is about (e.g., 'civil law')

Document metadata

Tag name / JSON field JSON value Description
dcterms:issued HTML publication date in YYYY-MM-DD
dcterms:modified Document modified
dcterms:identifier ECLI id suffixed with :DOC; irrelevant
dcterms:format 'text/html', irrelevant
htmlIssued String (YYYY-MM-DD date) Date on which this judgment was available on the web. Comes from one of two<code>dcterms:issued</code>: one for the issuing of the original judgment, one for issuing of the web page.

Register metadata

XML tag name JSON field name JSON value Description
dcterms:format String Doctype: text/xml; this is irrelevant for us.
metadataModified String (YYYY-MM-DDTh:mm:ss date) Date on which the metadata was last modified.
dcterms:modified String (YYYY-MM-DDTh:mm:ss date) Date on which the document was last modified.
dcterms:issues String XML publication date in YYYY-MM-DD.

Additional metadata

These additional metadata fields are generated by our server.

JSON field name JSON value Description
@type String (resource URI) Fixed to <code>frbr:LegalWork</code>
markedUpByRechtspraak Boolean Whether this document has rich markup, or consists only of <code><para></code> and <code><paragroup></code> elements.
owl:sameas String (resource URI) Deeplink to HTML manifestation of this document on <a href="http://www.rechtspraak.nl/">Rechtspraak.nl</a>
tokens Array of arrays of strings Tokenized version of judgment text with all XML tags stripped. Stemmed term count is implemented as a <a href="#term-frequency">MapReduce job</a>.

Views

A numbers of secondary views are defined on the data set.

--TODO table

dutch-case-law-to-couchdb's People

Contributors

digitalheir avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.