Giter Club home page Giter Club logo

metis-language-normalization's Introduction

metis-language-normalization

This project provides a tool for normalizing language related metadata, such as dc:language properties. The tool was designed after an investigation of how the dc:language property was being used by data providers of Europeana.

The project provides a normalization tool that is able to recognize language names in all European languages, and the codes of the ISO-639 family of standards. The tool has been designed to suport the integration into Metis, but without dependencies of the Metis framework. It can be used through:

  • A Java class (provided in a .jar package)
  • A REST API (the project contains the implementation as a REST service, as well as a corresponding Client for networked Java applications)

The tool implements an matching algorithm that makes extensive use of the Languages Name Authority List (NAL) published in the European Union Open Data Portal (https://open-data.europa.eu/en/data/dataset/language)

The normalized language output of the tools, can be configured for ISO-639-1 language codes (currently, the standard adopted by Europeana), ISO-639-2r, ISO-639-2t, ISO-639-3 (with limititations) and to the Languages NAL.

Results from the application in Europeana

In May 2016, the language normalization methods implemented in this tool were tested on the complete Europeana dataset. The analysis of the presence of ISO-639 familiy of language codes in dc:language values within the Europeana dataset is shown in the following table:

ISO-639-1 codes Other ISO-639 family codes Other values Total
dc:language property values in Europeana 23,988,536 5,485,168 9,792,054 39,265,758
(percentage of total values) (61.1%) (14.0%) (24.9%) (100.0%)

The second table, shown below, presents the number of dc:language values that can be normalized by applying the string matching techniques implemented in this tool. The table contains the number of values that can be obtained with each of the string matching methods, and the confidence we assigned to the method. The different matching techniques address several types of language codes and names, but, underlying their algorithms, is a core vocabulary of languages. This vocabulary contains all ISO language codes and language names, in all european languages. It is the the Languages Name Authority List (NAL) published in the European Union Open Data Portal - https://open-data.europa.eu/en/data/dataset/language

Description of normalization Confidence Level Normalization to ISO-639-1 Normalization to Languages NAL
Matching with a code from ISO-639-1 (currently in use at Europeana) Almost certain 23,988,536 23,988,536
Match with any ISO code Almost certain 4,200,398 5,485,168
Match with language name in any language (full field value) Very high 3,329,964 3,337,461
Match by name in any language or ISO code (words within field value) High 768,733 771,472
Total number of values that can be normalized - 32,287,631 33,582,637
(percentage of total values in Europeana) - (82.23%) (85.53%)

In summary, with the appliction of all matching methods and the adoption of the Languages NAL as the anchor vocabulary, the percentage of normalized dc:language values in the Europeana dataset may be improved from the current 61.09% (the existing values in ISO-639-1) to 85.53%.

metis-language-normalization's People

Contributors

nfreire avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.