Light

acoli-repo / olia Goto Github PK

View Code? Open in Web Editor NEW

19.0 4.0 1.0 173.63 MB

Ontologies of Linguistic Annotation. Machine-readable tagsets and annotation schemata for more than 100 languages.

License: Other

Java 26.57% Shell 0.33% CSS 0.46% HTML 70.18% JavaScript 1.26% XSLT 0.88% Python 0.25% Makefile 0.06%

linguistic-annotation natural-language-processing parts-of-speech-tagging dependency-syntax phrase-structure-tree parsing ontology interoperability

olia's Introduction

The Ontologies of Linguistic Annotations (OLiA) provide an OWL/DL taxonomy of data categories as a reference for linguistic annotation (OLiA Reference Model), plus OWL/DL models for a large number of annotation schemes (OLiA Annotation Models) and their relationship to reference data categories (OLiA Linking Models). The OLiA Reference Model itself is linked to community-maintained repositories such as GOLD (http://linguistics-ontology.org/) and ISOcat (http://www.isocat.org).

The OLiA ontologies were originally developed in the context of an infrastructure for the sustainable maintenance of linguistic resources, and they have been applied for the formalization of annotation schemes, concept-based querying over heterogeneously annotated corpora, the development of interoperable NLP pipelines, and as a central hub for annotation terminology in the Linguistic Linked Open Data (LLOD) cloud.

From 2005-2011, the OLiA ontologies were hosted at the University of Potsdam, Germany, 2012-2019 at SourceForge, and since 2019 at Github. Please do not directly refer to any of these repositories nor any other mirror, but use (and consult) links under http://purl.org/olia, instead.

OLiA has been designed as an open source resource and has always been released as such. The SourceForge edition (until 2019) used CC-BY-SA 3.0. For up-to-date licensing information and the necessary attribution, please see http://purl.org/olia/Readme.md.

Brought to you by the Applied Computational Linguistics lab at Goethe University Frankfurt, Germany.

OLiA is open source (code: Apache v.2, data: CC-BY 3.0), for terms of use, attribution and detailed documentation, please see the official website.

olia's People

Contributors

Stargazers

Watchers

Forkers

olia's Issues

LOV best practices

cf. LOV recommendations

submit OLiA ontologies to LOV, integrate feedback

Formal versioning

With milestone 1.0, we will apply numerical version numbering. At the moment (= as long as we stay backward-compatible), we operate with 0.*x* (without a numerical value for x).

Align/extend with schemes for semantic annotation

At the moment, OLiA is focusing on corpora with morphological or syntactic annotation. Semantic annotation has been excluded (unless addressed in such corpora or dictionaries) as we assume that this requires an agreement among developers of semantic resources such as WordNet, PropBank and FrameNet. Such efforts are underway, but current results (e.g., SemLink v.2) are too unstable to incorporate them into OLiA.

Most likely, these specifications should move into a separate reference model and current specifications for semantics should move there, too. This will break backward compatibility, hence milestone 1.0.

Revise top-level structure

Rename top-level categories (MorphosyntacticCategory) to units (MorphosyntacticUnit)
Integrate OLiA Discourse Extensions
[please document/discuss other requirements in this direction under this issue]

(breaks backward compatibility => milestone 1.0)

organize Annotation Models hierarchically

Right now, annotation models are a plain list. To more easily find relevant ones, they should have a shallow hierarchical structure:

/lang/ for language-specific schemes for morphology, morphosyntax or syntax, using BCP47 language codes
/multi/ for multi-/crosslingual/language-independent schemas for morphology, morphosyntax or syntax
/discourse/ for discourse
etc.

Breaks backward compatibility => milestone 1.0

rdfs:labels

at the moment, OLiA uses CamelCase URIs, but rarely rdfs:labels. To be added.

Split Reference Model into levels of description

The OLiA Reference Model is a monolithic knowledge graph for annotation terminology. However, this makes it relatively large and hard to use, and for a future OLiA releases, we suggest to break into different levels of description (syntax, parts of speech, [morphosyntactic] features, morphology, semantics, discourse) with corresponding sub-URIs, e.g., http://purl.org/olia/pos#Noun, http://purl.org/olia/feats#Singular, http://purl.org/olia/syntax#Subject, http://purl.org/olia/morph#Diminuitive, etc.). Then, the corresponding parts of the current olia-top.owl should be distributed across these, as well.

This breaks downward compatibility, hence milestone 1.0.

rename "annotation model" to "domain model" or "OLiA vocabulary"

Aside from linguistic annotations, OLiA now (2022) also provides models

for grammatical features in machine-readable dictionaries (e.g., LexInfo), and
for general linguistic terminology (BLL thesaurus)

The term "annotation model" is not really adequate for these applications.
The suggested change pertains to the documentation, only, but it sets it apart from all publications on OLiA so far. Its implementation should thus be aligned with milestone 1.0.

Add system:hasLemma and system:hasLemmaMatching for word-specific tags in annotation models

Extension to make sure that word-specific tags can be reproduced when mapping tags via OLiA.

Suggestion:

If hasLemma is defined in an annotation model, a particular instance is restricted to words with this exact lemma. hasLemma must be unique, if multiple lemmas are to be matched, use hasLemmaMatching
if hasLemmaMatching is defined in an annotation model, a particular instance is restricted to words whose lemma matches this exact regular expression.

From a review to an LREC-2020 paper:

. Some annotation schemes devise classes so that particular words
will always have the same tag, even though in particular sentences
they have different uses.  In LOB, for example (which I take as an
example because the manual is handy), the occurrences of "all" in
the two sentences "All mothers go there" and "Let all pray for
peace" are both tagged ABN.

Other annotation schemes may be devised which attempt to tag the
attributive and pronominal uses of "all" with different tags.

In the one scheme, the concepts of the tag set relate (at least in
these cases) to word forms like "all", and the intension of the
tag is, roughly "a word form which may sometimes be used as a
determiner and sometimes as a pronoun and ... (further
specification)".  In the other, the concepts of the tag set
relate, in the same cases, not to word forms but to word
occurrences, and the intension of the tags will be "a word token
used as determiner of a noun ..." and "a word token used
pronominally ...".

revise validation and publication workflow

Purl redirects

Via purl.org, I created a partial redirect for http://purl.org/olia to replace earlier direct redirects for individial files. In order for this to work for all files in the stable/ folder, their individual entries must be removed. For many, this has been applied already. However, the API is not reliable, and in many cases, the DELETE operation (despite reporting success) was not applied.

Note: This is a severe bug, as the original URLs do not resolve anymore.

TODO:

Perform weekly checks at purl.org as to whether all direct entries for stable/ ontologies have been deleted.
If not, apply the deletion for the top 10 files.

check LICENSE statement

at the moment, this seems to be in the ontologies, only.

include named entities into OLiA Discourse Extensions

request: mirror, link, and (optionally) revise the named entity categorization and linking from NERD
objective: NERD is a valuable resource but hasn't been maintained for 9 years; as of Sep 2022, the original website is down, but the code is still available via GitHub
suggestion: integrate into OLiA Discourse Extensions (rather than OLiA Reference Model) because it has a relationship with (co-)reference

unimorph:hasLabel

rename to unimorph:label (used in paper)
develop vocabulary to automatically map this to system:hasTagStartingWith, system:hasTagContaining, system:hasTagEndingWith [idea: additional property that defines separator symbols => generate all combinations of separators and the label to enable partial matches)

Remove deprecated URIs

Remove deprecated URIs (will break backward compatibility, do not apply to the current master branch).

consistency tests

to be developed and applied.
must include

XML validity (make release)
RDF validity (make release)
OWL-DL/2 validity (make release, more feedback with make validate)
URI resolution (make checks)
licence declaration

Migrate to different URI schema

Purl URIs are extremely hard to maintain. Migrate to W3ID or another solution for persistent URIs.

reactivate OLiA Discourse

as part of the revised validation process, this ontology (and some others) failed

more discourse resources

add discourse corpora

https://github.com/duncanka/BECauSE

Open MinTeD interoperability criteria

OpenMinTeD interoperability criteria: A number of quality criteria, most of which can be easily improved

migrate OLiA website and server

move from Frankfurt to GitHub, update Purls

Drop olia.nlp2rdf.org support

For integration in NLP2RDF/NIF workflows, there used to be a live mirror under http://olia.nlp2rdf.org. However, this seems to be defunct. As publication now involves a different process operating directly on GitHub, the following files should be removed or moved into a different sub-folder:

generate_csv_for_annotationmodel.sh
generate_csv_for_linkingmodel.sh
generate_java_map_from_csv.sh
OLiAMap.java
publish.sh

At the moment, these are being preserved because it is unclear whether there are NIF or Stanbol workflows depending on that.

Archivo interoperability criteria

Make sure / provide validator for

The ontology is automatically retrievable and parses (make release)
License statement provided as separate document (make release)
The license statement achieves minimial interoperability, can be validated with Archivo SHACL tests, cf. https://archivo.dbpedia.org/rating
LODE conformity (w. SHACL)
successfull consistency check by reasoner (make release)
submit ontologies to archivo

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.