This repository contains code that clones Dutch case law XML documents from data.rechtspraak.nl and converts the metadata to JSON. This makes the documents easier to process with a machine. Because we store the documents in a CouchDB database, this makes the data instantly amenable to MapReduce jobs.
Everything described here is a derivative of the rechtspraak.nl web service. Original documentation is is available in Dutch here.
Use this API to get Dutch case law documents. Documents are available in XML and HTML.
XML documents are retrieved using the URL scheme https://rechtspraak.cloudant.com/docs/{ECLI identifier}/data.xml
.
Example.
HTML snippets are retrieved using the URL scheme https://rechtspraak.cloudant.com/docs/{ECLI identifier}/data.htm
to
get a HTML snippet. Example.
Document metadata is retrieved using the URL scheme https://rechtspraak.cloudant.com/docs/{ECLI identifier}
.
Example.
Rechtspraak.nl contains metadata for judgments in XML/RDF, but the RDF is actually not well formed. XML is also arguably a bad format for metadata. In any case, CouchDB describes documents in JSON, so metadata is converted in JSON. In order to be RDF-compatible, we ashere to the JSON-LD format.
Also, some additional metadata fields are generated. There are also a bunch of MapReduce jobs defined on the data.
An attempt to correct metadata issues is undertaken in the cloning process. Document metadata is described in JSON format, respecting RDF triples through JSON-LD.
In the following table, all metadata fields are presented, and some guarantees are made about their JSON structure. We make some assumptions about the RDF triples that Rechtspraak.nl provides that are not strictly necessary, but makes the data easier to work with. Also, some values merit some extra processing in order to keep our RDF consistent.
Some fields are uncommon, so for each field we provide a link to an example document which uses that field. Visit ecli/_design/query_dev/_view/docs_with_field?group_level=2&startkey=[<field name>& endkey=[<field name>,] to see which documents contain that particular field.
These fields may appear either in the block for document metadata or the block for register metadata, and we assume they are the same in both.
Tag name / JSON field | JSON value | Description |
---|---|---|
dcterms:accessRights |
String (relative URI) | Fixed to 'public'. Some manifestations may be non-public, like ones with their names unredacted, but we don't have access to those. |
dcterms:publisher |
Object (resource) | Court. Assumed to be a single object. |
dcterms:title |
String (literal) | Document title. Most often, this is a concatenation of the ECLI number with the court name and date. |
dcterms:language |
String (resource URI) | Fixed to 'nl'. |
dcterms:abstract |
String (literal) | Short summary. We do not include abstracts that consist of a single dash, because they are uninformative. |
dcterms:replaces |
String (literal) | LJN number which this ECLI replaces |
dcterms:isReplacedBy |
String (literal) | If the current ECLI is not valid, this points to a replacement ECLI. Note this is only about the identifier. <a href>Doesn't seem to be used in practice.</a> |
dcterms:contributor |
Array of objects () | Supposedly denotes the judge. We would like to extend this to also include other entities such as lawyers . We may reify these links to denote the roles these people have in the case.<a href="https://rechtspraak.cloudant.com/ecli/_design/query_dev/_view/docs_with_field?stale=ok&limit=100&group_level=2&startkey=[%22dcterms:contributor%22]&endkey=[%22dcterms:contributor\ufff0%22]\">Doesnt seem to be used in practice.</a> |
dcterms:date |
String (literal) | Date of judgment |
dcterms:alternative |
Array of strings (literal) | |
psi:procedure |
List of objects (resources) | Aliases / alternative titles. <a href="https://rechtspraak.cloudant.com/ecli/_design/query_dev/_view/docs_with_field?stale=ok&limit=100&group_level=1&startkey=[%22dcterms:alternative%22]&endkey=[%22dcterms:alternative\ufff0%22]">Doesnt seem to be used in practice.</a> |
psi:procedure |
List of objects (resources) | What kind of procedure this case is (e.g., 'appeal'). Rechtspraak.nl XML assigns the label 'Procedure' to this tag using a <code>rdfs:label</code> predicate. To fully represent this in RDF, we should reify this triple. But to keep our document readable, we assign a JSON-LD alias from <code>Procedure</code> to <code>psi:procedure</code> in <code>@context</code>. |
dcterms:creator |
Object (resource) | Object (resource). Note that we assume a cardinality of 1: behaviour is not defined for multiple<code>dcterms:creator</code> tags. <td>Court in which this judgment was made. <strong>NOTE:</strong> psi:afdeling is deprecated, so we won't parse it </td> |
dcterms:type |
Object (resource) | Represents either 'Uitspraak' or 'Conclusie' ('judgment' or 'conclusion'). |
dcterms:temporal |
Object (resource) | Indicates a timespan between which the case must be judged, which may happen for example in tax law. |
dcterms:references |
Array of objects (resources) | These triple have additional data; what <em>kind</em> of reference is this? These should be reified on the triple, but we just add a <code>referenceType</code>field to the referent object.<strong>NOTE:</strong>Discussed whether this should this references an *expression* of a law, because it refers to the law at a particular time (usually the time of the court case). I don't resolve the expression because we can't know with full certainty to what time it refers. It's rechtspraak.nl's responsibility to get the reference right anyway. |
dcterms:coverage |
Array of objects (resources) | The jurisdiction to which this judgment is relevant |
dcterms:hasVersion |
Array of objects (resources) | Where versions of this judgment can be found. Might be different expressions (e.g., edited and annotated) |
dcterms:relation |
Array of objects (reified statements) | Relations to other cases. <cite><a href="http://dublincore.org/documents/dcmi-terms/#terms-relation">Dublin Core specification</a></cite> specifies: <blockquote> "Recommended best practice is to identify the related resource by means of a string conforming to a formal identification system" </blockquote> <strong>NOTE:</strong>this relation is reified so that we can make meta-statements about it. See <a href="http://stackoverflow.com/questions/5671227/ddg#5671407">stackoverflow.com/questions/5671227/ddg#5671407</a>. This might not follow dcterms best practices. |
psi:zaaknummer |
Array of strings (literal) | Existing case numbers |
dcterms:subject |
Array of objects (resource) | What kind of law this case is about (e.g., 'civil law') |
Tag name / JSON field | JSON value | Description |
---|---|---|
dcterms:issued |
HTML publication date in YYYY-MM-DD | |
dcterms:modified |
Document modified | |
dcterms:identifier |
ECLI id suffixed with :DOC; irrelevant | |
dcterms:format |
'text/html', irrelevant | |
htmlIssued |
String (YYYY-MM-DD date) | Date on which this judgment was available on the web. Comes from one of two<code>dcterms:issued</code>: one for the issuing of the original judgment, one for issuing of the web page. |
XML tag name | JSON field name | JSON value | Description |
---|---|---|---|
dcterms:format |
String | Doctype: text/xml; this is irrelevant for us. | |
metadataModified |
String (YYYY-MM-DDTh:mm:ss date) | Date on which the metadata was last modified. | |
dcterms:modified |
String (YYYY-MM-DDTh:mm:ss date) | Date on which the document was last modified. | |
dcterms:issues |
String | XML publication date in YYYY-MM-DD. |
These additional metadata fields are generated by our server.
JSON field name | JSON value | Description |
---|---|---|
@type |
String (resource URI) | Fixed to <code>frbr:LegalWork</code> |
markedUpByRechtspraak |
Boolean | Whether this document has rich markup, or consists only of <code><para></code> and <code><paragroup></code> elements. |
owl:sameas |
String (resource URI) | Deeplink to HTML manifestation of this document on <a href="http://www.rechtspraak.nl/">Rechtspraak.nl</a> |
tokens |
Array of arrays of strings | Tokenized version of judgment text with all XML tags stripped. Stemmed term count is implemented as a <a href="#term-frequency">MapReduce job</a>. |
A numbers of secondary views are defined on the data set.
--TODO table