Giter Club home page Giter Club logo

acoli-repo / olia Goto Github PK

View Code? Open in Web Editor NEW
19.0 4.0 1.0 173.63 MB

Ontologies of Linguistic Annotation. Machine-readable tagsets and annotation schemata for more than 100 languages.

License: Other

Java 26.57% Shell 0.33% CSS 0.46% HTML 70.18% JavaScript 1.26% XSLT 0.88% Python 0.25% Makefile 0.06%
linguistic-annotation natural-language-processing parts-of-speech-tagging dependency-syntax phrase-structure-tree parsing ontology interoperability

olia's Introduction

The Ontologies of Linguistic Annotations (OLiA) provide an OWL/DL taxonomy of data categories as a reference for linguistic annotation (OLiA Reference Model), plus OWL/DL models for a large number of annotation schemes (OLiA Annotation Models) and their relationship to reference data categories (OLiA Linking Models). The OLiA Reference Model itself is linked to community-maintained repositories such as GOLD (http://linguistics-ontology.org/) and ISOcat (http://www.isocat.org).

The OLiA ontologies were originally developed in the context of an infrastructure for the sustainable maintenance of linguistic resources, and they have been applied for the formalization of annotation schemes, concept-based querying over heterogeneously annotated corpora, the development of interoperable NLP pipelines, and as a central hub for annotation terminology in the Linguistic Linked Open Data (LLOD) cloud.

From 2005-2011, the OLiA ontologies were hosted at the University of Potsdam, Germany, 2012-2019 at SourceForge, and since 2019 at Github. Please do not directly refer to any of these repositories nor any other mirror, but use (and consult) links under http://purl.org/olia, instead.

OLiA has been designed as an open source resource and has always been released as such. The SourceForge edition (until 2019) used CC-BY-SA 3.0. For up-to-date licensing information and the necessary attribution, please see http://purl.org/olia/Readme.md.

Brought to you by the Applied Computational Linguistics lab at Goethe University Frankfurt, Germany.

OLiA is open source (code: Apache v.2, data: CC-BY 3.0), for terms of use, attribution and detailed documentation, please see the official website.

olia's People

Contributors

cfaeth avatar chiarcos avatar giusepperizzo avatar kurzum avatar max-ionov avatar v-dimitrova avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

azaddjanibd

olia's Issues

Formal versioning

With milestone 1.0, we will apply numerical version numbering. At the moment (= as long as we stay backward-compatible), we operate with 0.*x* (without a numerical value for x).

Align/extend with schemes for semantic annotation

At the moment, OLiA is focusing on corpora with morphological or syntactic annotation. Semantic annotation has been excluded (unless addressed in such corpora or dictionaries) as we assume that this requires an agreement among developers of semantic resources such as WordNet, PropBank and FrameNet. Such efforts are underway, but current results (e.g., SemLink v.2) are too unstable to incorporate them into OLiA.

Most likely, these specifications should move into a separate reference model and current specifications for semantics should move there, too. This will break backward compatibility, hence milestone 1.0.

Revise top-level structure

  • Rename top-level categories (MorphosyntacticCategory) to units (MorphosyntacticUnit)
  • Integrate OLiA Discourse Extensions
    [please document/discuss other requirements in this direction under this issue]

(breaks backward compatibility => milestone 1.0)

organize Annotation Models hierarchically

Right now, annotation models are a plain list. To more easily find relevant ones, they should have a shallow hierarchical structure:

  • /lang/ for language-specific schemes for morphology, morphosyntax or syntax, using BCP47 language codes
  • /multi/ for multi-/crosslingual/language-independent schemas for morphology, morphosyntax or syntax
  • /discourse/ for discourse
    etc.

Breaks backward compatibility => milestone 1.0

rdfs:labels

at the moment, OLiA uses CamelCase URIs, but rarely rdfs:labels. To be added.

Split Reference Model into levels of description

The OLiA Reference Model is a monolithic knowledge graph for annotation terminology. However, this makes it relatively large and hard to use, and for a future OLiA releases, we suggest to break into different levels of description (syntax, parts of speech, [morphosyntactic] features, morphology, semantics, discourse) with corresponding sub-URIs, e.g., http://purl.org/olia/pos#Noun, http://purl.org/olia/feats#Singular, http://purl.org/olia/syntax#Subject, http://purl.org/olia/morph#Diminuitive, etc.). Then, the corresponding parts of the current olia-top.owl should be distributed across these, as well.

This breaks downward compatibility, hence milestone 1.0.

rename "annotation model" to "domain model" or "OLiA vocabulary"

Aside from linguistic annotations, OLiA now (2022) also provides models

  • for grammatical features in machine-readable dictionaries (e.g., LexInfo), and
  • for general linguistic terminology (BLL thesaurus)

The term "annotation model" is not really adequate for these applications.
The suggested change pertains to the documentation, only, but it sets it apart from all publications on OLiA so far. Its implementation should thus be aligned with milestone 1.0.

Add system:hasLemma and system:hasLemmaMatching for word-specific tags in annotation models

Extension to make sure that word-specific tags can be reproduced when mapping tags via OLiA.

Suggestion:

  • If hasLemma is defined in an annotation model, a particular instance is restricted to words with this exact lemma. hasLemma must be unique, if multiple lemmas are to be matched, use hasLemmaMatching
  • if hasLemmaMatching is defined in an annotation model, a particular instance is restricted to words whose lemma matches this exact regular expression.

From a review to an LREC-2020 paper:

. Some annotation schemes devise classes so that particular words
will always have the same tag, even though in particular sentences
they have different uses.  In LOB, for example (which I take as an
example because the manual is handy), the occurrences of "all" in
the two sentences "All mothers go there" and "Let all pray for
peace" are both tagged ABN.

Other annotation schemes may be devised which attempt to tag the
attributive and pronominal uses of "all" with different tags.

In the one scheme, the concepts of the tag set relate (at least in
these cases) to word forms like "all", and the intension of the
tag is, roughly "a word form which may sometimes be used as a
determiner and sometimes as a pronoun and ... (further
specification)".  In the other, the concepts of the tag set
relate, in the same cases, not to word forms but to word
occurrences, and the intension of the tags will be "a word token
used as determiner of a noun ..." and "a word token used
pronominally ...".

revise validation and publication workflow

  • create a central Makefile
  • automated XML, RDF and OWL validation (#5)
    • XML validation with xmllint (=> make release)
    • RDF validation with rapper (=> make release)
    • drop tools/validate_with_bash.sh
    • drop tools/validate_with_eyeball.sh and tools/eyeball (Eyeball is deprecated)
    • integrate OWL validation with Pellet (=> make validate)
  • automated validation of ontology metadata (#1)
    • check rdfs:labels (#4)
  • copy valid ontologies (from stable/, core/ and experimental/) to docs/owl
  • update Purl links to point to https://acoli-repo.github.io/olia/owl (cf. #10)
  • automated generation/refresh of docs/html
    • update/replace/automatize tools/olia2html
  • automated generation/update of docs/models.md

Purl redirects

Via purl.org, I created a partial redirect for http://purl.org/olia to replace earlier direct redirects for individial files. In order for this to work for all files in the stable/ folder, their individual entries must be removed. For many, this has been applied already. However, the API is not reliable, and in many cases, the DELETE operation (despite reporting success) was not applied.

Note: This is a severe bug, as the original URLs do not resolve anymore.

TODO:

  • Perform weekly checks at purl.org as to whether all direct entries for stable/ ontologies have been deleted.
  • If not, apply the deletion for the top 10 files.

include named entities into OLiA Discourse Extensions

  • request: mirror, link, and (optionally) revise the named entity categorization and linking from NERD
  • objective: NERD is a valuable resource but hasn't been maintained for 9 years; as of Sep 2022, the original website is down, but the code is still available via GitHub
  • suggestion: integrate into OLiA Discourse Extensions (rather than OLiA Reference Model) because it has a relationship with (co-)reference

unimorph:hasLabel

  • rename to unimorph:label (used in paper)
  • develop vocabulary to automatically map this to system:hasTagStartingWith, system:hasTagContaining, system:hasTagEndingWith [idea: additional property that defines separator symbols => generate all combinations of separators and the label to enable partial matches)

Remove deprecated URIs

Remove deprecated URIs (will break backward compatibility, do not apply to the current master branch).

consistency tests

to be developed and applied.
must include

  • XML validity (make release)
  • RDF validity (make release)
  • OWL-DL/2 validity (make release, more feedback with make validate)
  • URI resolution (make checks)
  • licence declaration

more discourse resources

Open MinTeD interoperability criteria

Drop olia.nlp2rdf.org support

For integration in NLP2RDF/NIF workflows, there used to be a live mirror under http://olia.nlp2rdf.org. However, this seems to be defunct. As publication now involves a different process operating directly on GitHub, the following files should be removed or moved into a different sub-folder:

  • generate_csv_for_annotationmodel.sh
  • generate_csv_for_linkingmodel.sh
  • generate_java_map_from_csv.sh
  • OLiAMap.java
  • publish.sh

At the moment, these are being preserved because it is unclear whether there are NIF or Stanbol workflows depending on that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.