Giter Club home page Giter Club logo

lobid-gnd's Introduction

About

lobid-gnd: access GND+EntityFacts data as JSON-LD over HTTP.

Setup

Prerequisites

sbt 0.13 or newer — download sbt

Elasticsearch 5.6.x (configured in application.conf)

Build

Get the code, change into the project directory, and run the tests:

git clone https://github.com/hbz/lobid-gnd.git ; cd lobid-gnd ; sbt test

Data

The are three data sources involved:

Entity Facts (JSON-LD over HTTP), GND baseline (RDF-XML over HTTP), and GND updates (RDF-XML over OAI-PMH).

Entity Facts

Set up a location for the Entity Facts input data:

mkdir entityfacts ; cd entityfacts

Get the latest Entity Facts data from the DNB (see https://data.dnb.de/opendata/):

wget https://data.dnb.de/opendata/authorities_entityfacts.jsonld.gz

Unpack the data:

gunzip < authorities_entityfacts.jsonld.gz > authorities_entityfacts.jsonld

Go back to the root directory:

cd ..

Index the data, passing the index name:

sbt -Dindex.entityfacts.index=entityfacts_20210120 "runMain apps.Index entityfacts"

For configuration details and defaults, see ‘conf/application.conf’.

GND Baseline

Get the RDF data

Set up a location for the input data:

mkdir input_data; cd input_data

Set ‘data.rdfxml’ in ‘conf/application.conf’ to the ‘input_data’ location.

Get the GND RDF/XML source data from https://data.dnb.de/opendata/:

wget https://data.dnb.de/opendata/authorities-{geografikum,koerperschaft,kongress,person,sachbegriff,werk}_lds.rdf.gz

This should give you 6 local files ending with ‘.rdf.gz’. Go back to the project root directory:

cd ..

Convert RDF/XML to JSON

Set up a location for the index data:

mkdir index_data

Set ‘data.jsonlines’ in ‘conf/application.conf’ to the ‘index_data’ location.

Set ‘index.boot’ in ‘conf/application.conf’ to an existing index. This index will be used to get labels during the conversion process.

Set ‘index.prod’ in ‘conf/application.conf’ to a non-existing index. This index name will be used in the indexing data created during conversion.

Convert the data to JSON-LD lines, the index data format:

sbt "runMain apps.ConvertBaseline"

To be able to log out from the server while the conversion is running, we actually use (see full usage details in baseline.sh):

setsid nohup sbt "runMain apps.ConvertBaseline" &

This should create 6 ‘*.jsonl’ files in ‘index_data’.

Index the JSON data

If the ‘index.prod’ configured in ‘application.conf’ does not exists, a new index will be created.

To start the indexing, run:

sbt "runMain apps.Index baseline"

Updates

Get and convert the updates

Updates are pulled via the DNB OAI-PMH interface.

Pass one or two arguments: get updates since (and optionally until) a given date:

sbt "runMain apps.ConvertUpdates 2022-06-22 2022-06-23"

The date of the most recent update is stored in ‘GND-lastSuccessfulUpdate.txt’ (can be changed in the config).

The original downloaded data and the converted data are stored in separate files. To convert the data again without downloading it, use the steps described above under ‘Convert RDF/XML to JSON’ with the update RDF data.

Index the updates

To index the updates run:

sbt "runMain apps.Index updates"

See ‘application.conf’ for details on the configured file names etc.

Web

In ‘lobid-gnd’, run the web application:

sbt run

Open http://localhost:9000/gnd

Eclipse

To set up an Eclipse project, first generate the Eclipse config for your machine:

sbt "eclipse with-source=true"

Then import the project in Eclipse: “File” > “Import” > “Existing Projects into Workspace”.

lobid-gnd's People

Contributors

acka47 avatar christophewertowski avatar dr0i avatar fsteeg avatar librerli avatar phu2 avatar rettinghaus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lobid-gnd's Issues

Use general preferredName / variantName properties

From #1:

We don't want specific name properties like preferredNameForThePlaceOrGeographicName and variantNameForThePlaceOrGeographicName. For all entities, we should just use preferredName and variantName.

This will allow us to query the whole data in a uniform way. (The type is made clear by other means so that we don't need the specific properties.)

Test serialization of GND RDF-XML to compact JSON-LD

Both dumps and updates (via OAI) are available as RDF-XML, so that would be a suitable source format:

http://datendienst.dnb.de/cgi-bin/mabit.pl?userID=opendata&pass=opendata&cmd=login
http://www.dnb.de/DE/Service/DigitaleDienste/OAI/oai_node.html (s. "Formate")

We should test serializing that RDF-XML as compact JSON-LD using the entityfacts context:

http://hub.culturegraph.org/entityfacts/context/v1/entityfacts.jsonld
http://hub.culturegraph.org/entityfacts/118540238

If the result looks good, this might be the format to index in Elasticsearch. We might have to do some preprocessing to make sure the values always have the same type (see footnote 1 in http://blog.lobid.org/2017/06/08/lobid-api-why-how.html about compact JSON-LD serialization in Elasticsearch).

Create & publish JSON-LD context that includes keys for classes

Moving from old lobid-gnd repo, issue by @acka47.

We will use short keys instead of the URIs for classes/types in GND API 2.0. Thus, we will need these keys be defined in the @context. At best, we will generate this (half-)automatically from the GND ontology.

We already did some half-automatic generation of the GND context, see lobid/lodmill#251 & https://gist.github.com/niklasl/2770154#gistcomment-946311.

I once again used Niklas' python tool and created the JSON-LD context. Afterwards a bit of manual post-processing was necessary as well as addition of owl:sameAs and foaf:page and the deletion of deprecated properties and classes. The result can be found at .https://gist.github.com/acka47/98035a3f215c783bdc00.

Leaving this open (and renaming the ticket) until the context is published.

Enable filtering based on GND ontology types

Moved here with some adjustments from lobid/lodmill#447.

Example query: "heinsberg"

Facets/possibilities for filtering by secend-level types:

  • Person
  • Subject Heading
  • Place or Geographic Name
  • Converence or Event
  • Family
  • Corporate Body
  • Work

(Similar to e.g. https://portal.dnb.de/opac.htm?method=simpleSearch&query=heinsberg)

Clicking a facet should yield filtered search results with sub-facets, e.g.:

Place or Geographic Name:

  • Building or Memorial
  • Administrative Unit
  • Fictive Place
  • ...

Probleme im GND-RDF

Improve (sub-)type filtering

Currently, we only allow filtering by first-level types Person, CorporateBody, ConferenceOrEvent, Work, PlaceOrGeographicName, SubjectHeading, Family.

One adjustment should be that the filter section adjusts when clicking on a type: Either the not-clicked first-level types shoud disappear (and the sub-types should be shown, see below) or the not-clicked types should be greyed out.

We should also enable filtering by the subtypes in these two ways:

  1. Add a "+" sign to first-level types with sub-typing so that people can klick on it and the sub-types are shown. (E.g. for "Place or Geographic Name" "Building or Memorial", "Administrative Unit", "Fictive Place" etc.)
  2. Filtering by a first-level type should yield filtered search results with sub-facets opening up

Use information from EntityFacts

Documentation: http://www.dnb.de/DE/Service/DigitaleDienste/EntityFacts/entityfacts_node.html

Example of sameAs links for London:

{
  "sameAs":[
    {
      "@id":"http://d-nb.info/gnd/4074335-4/about",
      "collection":{
        "abbr":"DNB",
        "name":"Gemeinsame Normdatei (GND) im Katalog der Deutschen Nationalbibliothek",
        "publisher":"Deutsche Nationalbibliothek",
        "icon":"http://www.dnb.de/SiteGlobals/StyleBundles/Bilder/favicon.png?__blob=normal&v=1"
      }
    },
    {
      "@id":"http://viaf.org/viaf/236493943",
      "collection":{
        "abbr":"VIAF",
        "name":"Virtual International Authority File (VIAF)",
        "publisher":"OCLC",
        "icon":"http://viaf.org/viaf/images/viaf.ico"
      }
    },
    {
      "@id":"http://www.wikidata.org/entity/Q6669738",
      "collection":{
        "abbr":"WIKIDATA",
        "name":"Wikidata",
        "publisher":"Wikimedia Foundation Inc.",
        "icon":"https://www.wikidata.org/static/favicon/wikidata.ico"
      }
    },
    {
      "@id":"https://de.wikisource.org/wiki/London",
      "collection":{
        "abbr":"WIKISOURCE",
        "name":"Wikisource",
        "publisher":"Wikimedia Foundation Inc.",
        "icon":"https://wikisource.org/static/favicon/wikisource.ico"
      }
    },
    {
      "@id":"https://en.wikipedia.org/wiki/London%2C_Wisconsin",
      "collection":{
        "abbr":"enwiki",
        "name":"Wikipedia (English)",
        "publisher":"Wikimedia Foundation Inc.",
        "icon":"https://en.wikipedia.org/static/favicon/wikipedia.ico"
      }
    }
  ]
}

First step: Use EntityFacts to show links (using the icons linke din the JSON) to other resources in the HTML interface.

Possible further step: Enrich JSON-LD with links from EntityFacts. There might be a dump in the future we could build a map from, see https://twitter.com/junicatalo/status/971793109796433926.

Adjust context for GND release 2018.02

The GND relase 2018.02 on 2018-05-15 will bring some additional sub-classes with it. From the announcement:

3. Änderungen in der GND-Ontologie

Die Klasse „gndo:CorporateBody“ erhält vier neue Unterklassen:
„gndo:Company“, „gndo:MusicalCorporateBody“, „gndo:ReligiousCorporateBody“
und „gndo:ReligiousAdministrativeUnit

4. Änderungen in der GND-Konversion

Vier weitere Unterarten von Körperschaften („gndo:CorporateBody“)
werden explizit mit rdf:type als solche ausgezeichnet. Es handelt sich um die
Klassen „gndo:Company“, „gndo:MusicalCorporateBody“,
„gndo:ReligiousCorporateBody“ und „gndo:ReligiousAdministrativeUnit“. Sie sind
jeweils Unterklassen (rdfs:subClassOf) von „gndo:CorporateBody“, vgl. 3. Abschnitt
zur GND-Ontologie. Wie in dem Fall üblich, wird auf eine weitere explizite
Zuordnung zur Oberklasse „gndo:CorporateBody“ verzichtet.

Beispiele in der Testdatei:
http://d-nb.info/gnd/6518137-2, http://d-nb.info/gnd/111260619X,
http://d-nb.info/gnd/2001043-6

Add autocomplete support

Copied from lobid/lodmill#468, originally opened by @nichtich. We should consider this for the new implementation of GND lookup. (I think, @jschnasse also mentioned this.)

For simple lookup it would be nice to add format=suggest in OpenSearch Suggestions format. Format short returns a plain array of strings, e.g. http://api.lobid.org/person?name=Marx&format=short

[
    "Marx, Antônio Augusto (1919-)",
    "Marx, J. A.",
    "Marxsen, Peter Christian (1806-1869)",
    ...
]

OpenSearch Suggestions would be:

[
   "Marx",
   [
       "Marx, Antônio Augusto",
       "Marx, J. A.",
       "Marxsen, Peter Christian",
       ...
   ],
   [
       "Architekt und Künstler (1919-)"
       "Organist und Musiker in Yselstein",
       "Subrektor (1806-1869)",
       ...
   ],
   [
       "http://d-nb.info/gnd/1030092001",
       "http://d-nb.info/gnd/1012540626",
       "http://d-nb.info/gnd/1016221088",
       ...
   ],    
]

Automate data transformation for production

We have to automate our data transformation setup for production mode:

  • Baseline transformation and indexing, log errors, monitor log
  • Update transformation, indexing, logging , monitoring. See also commit comment for 00ca2a6

Both could be triggered via curl POSTs from a cron job (like in lobid-organisations). Calling the transformation from within the web app would result in logging to the application.log (currently calling classes with main methods manually for transformations).

  • Set up routes and controllers for triggering baseline and update transformations
  • Set up dates to pass to the updates transformation, see also commit comment for 00ca2a6
  • Set up cron jobs for triggering the baseline and updates transformation

Server side setup should probably wait until https://github.com/hbz/lobid-webserver/issues/5 is done.

Provide more filter options

We discussed which filter options we could add. I identified for additional filters that could be useful, two of them only covering persons:

For differentiated persons which account for ~31% of the entries, there are two more:

Design detail view

  • Think about the labels for the different properties (@acka47)
  • Add map view to PlaceOfGeographicname if easy
  • General design and layout

Design search result list

This probably has #24 (plus the addition of other labels) as prerequisite.

Currently, the search result list shows name, some variant names and the GND ID with a link to the entry at the DNB. With this ticket we should discuss and implement a more useful result list.

Here are some ideas:

  • Neither show the GND ID or a link to the DNB service. It suffices to have this in the detail view
  • For Person, show occupation (professionOrOccupation) and birth/death year.
  • For ConferenceOrEvents, show place (placeOfConferenceOrEvent) and date (dateOfConferenceOrEvent. (May not be necessary as this is also appended in parentheses to the preferred name.)
  • for CorporateBody show placeOfBusiness (more than half of the 1.4 Million corporate bodies have that info).
  • For Workwe definitely need firstAuthor/firstComposer,
  • For PlaceOrGeographicName show geographicAreaCode
  • For SubjectHeading show gndSubjectCategory

Don't use an array for "preferredName"

As resources only have one preferred name, an array isn't needed or might be confusing. Also, we don't have an array for label in lobid-resources so this would be consistant with our other services.

Add profession to auto-suggest labels

Copied from lobid/lodmill#75.

To improve disambiguation during auto-suggest selection we should add the profession to auto-suggest labels on the /person endpoint. The information is available through the GND links, but we have to resolve the labels

Add 2nd level GND typing

From #1 (comment):

Furthermore, we should have a type from the second level of GND ontology attached to each resource. We will need this for facetting. GND ontology has three levels in its type hierarchy (except for Person, where we have a fourth one added). see the overview over the GND class hierarchy at https://wiki1.hbz-nrw.de/x/CIeW. In the concrete example, PlaceOrGeographicName should be in the data.

Add labels to ID fields

For example http://test.lobid.org/authorities/118506560.

Currently looks like:

{
"professionOrOccupation": [
  "http://d-nb.info/gnd/4131406-2",
  "http://d-nb.info/gnd/4012434-4"
  ]
}

With added labels:

{
   "professionOrOccupation":[
      {
         "id":"http://d-nb.info/gnd/4131406-2",
         "preferredName":"Pianist"
      },
      {
         "id":"http://d-nb.info/gnd/4012434-4",
         "preferredName":"Dirigent"
      }
   ]
}

However, there are several other cases (e.g. placeOfDeath, relatedWork, familialRelationship) where we only have internal GND links without labels in the data. Thus, a general approach would probably make sense to automatically add a label for all URIs from the http://d-nb.info/gnd/ namespace. (If we implement a general approach, it should be possible to exclude some properties from this label enrichment, e.g. we might get problems with trying to fetch the label for deprecatedUri and also labels for sameAs links within GND aren't necessary I think.) What do you think, @fsteeg?

Add links from GND entitites to lobid-resources

If an authority resource is covered in lobid-resources either in the contribution array or in the subject array, we should show links underneath its entry titled:

  • "Zeige Titel von $Agent"
  • "Zeige Titel über $Entity"

Clicking on the link will open another browser tab/window with the respective result list.

Add new properties to @context for GND relase 2018.01

With the new release on 2018-01-16, there will be somenew properties used in GND data, see the news Änderungen im Format RDF ab 16. Januar 2018 (Export-Release 2018.01), pages 1-3. We will have to update the JSON-LD context to accomodate those changes.

Four abbreviatedName properties:

  • gndo:abbreviatedNameForTheConferenceOrEvent
  • gndo:variantNameForTheConferenceOrEvent
  • gndo:abbreviatedNameForTheWork
  • gndo:abbreviatedNameForThePlaceOrGeographicName

Twelve properties from AgRelOn:

  • agrelon:hasParent
  • agrelon:hasChild
  • agrelon:hasSpouse
  • agrelon:hasSibling
  • agrelon:hasGrandParent
  • agrelon:hasGrandChild
  • agrelon:hasNieceNephew
  • agrelon:hasAuntUncle
  • agrelon:hasFriend
  • agrelon:hasColleague
  • agrelon:hasTeacher
  • agrelon:hasStudent

Add dateOfModification of norm data files

We should add the date of our latest pull from the GND. So that users don't have to pull all GND files.

Even if the DNB adds the modification date from PICA it wouldn't be optimal as it's changed even if the postbox is used and nothing changes on the norm data itself.

Filtering shouldn't alter ranking

For example:

  1. http://test.lobid.org/gnd/search?q=* -> The first 10 hits are all persons starting with "G"
  2. Filter by Person -> Not only are the other entities filtered out but the ranking is changed. Now persons with "Person" in their name are shown on top of the list.

Probably the best way to handle this would be to use elasticsearch filters instead of queries for facetting, see https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html (as we do in OER World Map).

BTW, we do it like this in all our services...

Rename to lobid-gnd

We decided to rename the service to lobid-gnd.

  • Rename git repo
  • Rename Play project
  • Change URIs in transformation
  • Change HTTP routes
  • Change in UI
  • Update proxy settings

Remove blank node ids

From #1 (comment):

Re. the framing output from http://tinyurl.com/ychm4t92, I just noticed that blank nodes get an id:

     "hasGeometry": {
       "@id": "_:b0",
       "@type": "http://www.opengis.net/ont/sf#Point",
       "asWKT": "Point ( -000.125740 +051.508530 )"
     }

We should get rid of them. This has already been addressed in the JSON-LD Framing spec 1.1 ("pruneBlankNodeIdentifiers") but is currently only implemented in the Ruby library, see json-ld/json-ld.org#293.

Hide variant names in detail view

When there are lots of variant names it doesn't look very good, e.g. https://test.lobid.org/gnd/1083009273

If we show the field at all we should hide it by default like so:

Gelin-xiongdi; Grimmů, Bratří; Grim, Broliai; Grimm, Frers; Grimm, Brüder; Grimm, Irmãos; Grimm, Bröderna; Grimm, Brothers; Grimm, Fratelli; Grimm, Bratraj; Grimai; Grîm, hā-Āḥîm; Grimmovci; Gkrim, Adelphōn; Grimm, Gebroeders; Grimm, Bratří; Grimm, Braci; Grim, Ve͏̈llez͏̈rit; Grimm, Bracia; Kardešler Grimm; Breudeur Grimm; Grimm, Frères; Grimm, Brdr.; Grimm, Braty; Grimm, Hermanos; Grimm, Frars; Gurimu; Gebrüder Grimm; Grimm, Brat'ja; Grimm, Brieder; Grim, Braḱata; Grimm, Germans; Adelphōn Gkrim; Grimmové; Gha-rim; Grīm, Barādarān-i; Grimm, G.; Grin; Ki-lwen; Braḱata Grim; Grimm, Bratʹev; Kŭrim, Hyŏngje; Grimmu, Brāļu; Kŭrīm, hyŏngje; Grimm, Braća; Grim, Bratja; Braci Grimm; Grim, Brider; Grimm, Gebrüder; Grimm, Frații; Grimm, Brata; Gkrimm, Adelphoi; Ghi-rim; Grimm, Brødrene; Grimm, Kardešler; Gurimu, Kyōdai; Grim Bandhu; Ge lin xiong di; Grimmin, Veljekset; Gimrm, Jcoab; Grimm, Bratray

Make numbers easier to read

For the result and facet counts, e.g. instead of "4577743" "4.577.743". Keep in mind that we will implement this differently for the – to be added – English version of the service.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.