Giter Club home page Giter Club logo

wissenschaftsstandard's Introduction

WissenschaftsSTANDARD 🏹

A German dataset for few-shot relation extraction.

Data Source

The sentences in WissenschaftsSTANDARD are drawn from the 10kGNAD dataset. 10kGNAD is a topic classification dataset with journalistic articles from the Austrian newspaper "Der STANDARD". WissenschaftsSTANDARD only uses sentences from the "Wissenschaft" (science) topic category in 10kGNAD. Therefore, the sentences mainly treat scientific discoveries and events in academia.

Annotation Process

This dataset was created for the master thesis "Few-Shot Relation Extraction for German" by Anna Sauer.

The sentences were annotated by a marvellous crowdsourcing team including Chris, Christl, Florian, Hans, Jean-Francois, Julia, Kristin, Marco, Marina, Marius, MG and Sandra. Thanks to you all! 💘

The minimalist annotation tool Locksley was used in the creation of WissenschaftsSTANDARD.

Relation set

WissenschaftsSTANDARD contains the following eleven relations:

id German name English name description number of instances
0 andere OTHER 1,496
1 veröffentlicht in publishes in a person publishes a text in a scientific journal 61
2 Leiter:in director a person is at the head of an organisation 76
3 gehört an affiliation a person is affiliated with an organisation (workplace, membership etc.) 519
4 Kollegin, Kollege colleague two people work together 208
5 nachgeordnete Organisation subsidiary an organisation belongs to a larger parent organisation 145
6 Partnerorganisation partner organisation two organisations work together 216
7 Standort in located in an organisation is situated in a location 260
8 Teilort geographical part of a location is part of a larger geographical entity 116
9 verleiht Preis an awards prize to an organisation awards a prize to a person 13
10 Gründer:in founder a person has founded an organisation 6

In the dataset file, they are all referred to by their id. The total of labeled relation instances amounts to 3,116. Note that one sentence from 10kGNAD can be labeled with several relations and can therefore appear several times in WissenschaftsSTANDARD.

The relations are assumed to have a fixed direction from one entity to another one. Nevertheless, there are also two symmetric relations, colleague and partner organisation, that go both ways. For example, consider the sentence "Robin Hood and Friar Tuck are colleagues.". There is a colleague relation from Robin Hood to Friar Tuck. In addition, there is also a colleague relation from Friar Tuck. In cases like this, WissenschaftsSTANDARD contains an separate instance for both directions.

Dataset Format

The format of the JSON file is modeled after the data format of the FewRel benchmark for few-shot relation extraction. Each file contains a dictionary whose keys are the names of the relations in the dataset. For each relation key, the corresponding value is a list of the labeled instances of that relation. This list contains a dictionary for each individual instance with

  • "tokens": a list with the token string sequence in the sentence
  • "h": information on the head entity in a list with
    • a string with the entity mention in lower case
    • a string with the Wikidata id of the entity (cf. wikidata.org). In WissenschaftsSTANDARD, this string is left empty because no entity linking between the head and tail entity and their Wikidata equivalent is performed.
    • a list with a nested list that contains the indices of the entity mention tokens in the sentence
  • "t": information on the tail entity in a list with the same structure as "h"
  • "ner": a list with information obtained from the task of named entity recognition (NER). The list contains the BIOES entity type tag for each token in the sentence. The BIOES tagging has been created using the German NER model in Stanza with the CoNLL 2003 tag set. This tag set contains the entity types PER (person), ORG (organisation) and LOC (location) (cf. https://stanfordnlp.github.io/stanza/available_models.html#available-ner-models).

Consider the following made-up example:

{
  "tokens": ["Robin", "Hood", "lebt", "in", "Sherwood", "Forest", "."],
  "h": ["sherwood forest", "", [[4, 5]]],
  "t": ["robin hood", "", [[0, 1]]],
  "ner": ["B-PER", "E-PER", "O", "O", "B-PER", "E-PER", "O"]
}

License

10kGNAD is licensed under a Creative Commons BY-NC-SA 4.0 license (cf. https://github.com/tblock/10kGNAD). Therefore, WissenschaftsSTANDARD is also released under the Creative Commons BY-NC-SA 4.0 license.

wissenschaftsstandard's People

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.