Giter Club home page Giter Club logo

regenykorpusz's Introduction

ELTE Novel Corpus

The ELTE Novel Corpus is a continuously expanding database developed by the Department of Digital Humanities at Eötvös Loránd University. Currently, the corpus contains 400 Hungarian novels. Besides the texts, the corpus contains the annotation of structural units and the grammatical features of words in TEI XML format. The novels of the corpus are from the 19th century and from the first half of the 20th century.

Numeric properties (level2):

The numbers below present the numeric properties of the level2 novels. The novels of level1 are currently being expanded.

  • number of novels: 400
  • number of authors: 119
  • number of tokens: 26.8 million
  • number of words: 21.4 million

Metadata of the novels:

The level1_metadata.tsv file contains the main metadata for level1 novels and the level2_metadata.tsv file contains the main metadata for level2 novels. WARNING: Since level1 novels are currently being expanded, it is possible that the level1_metadata.tsv file is not up to date, that is, some novels added newly are not included in the TSV file.

TEI Levels

The source of the corpus was the collection of the Hungarian Electronic Library.

  1. The texts from the Hungarian Electronic Library were converted into TEI XML format based on the Text Encoding Initiative. The TEI XML files contain the annotation of structural units and the metadata of the novels. The conversion was partly done manually (level1).
  2. Then, we tokenized the novels and annotated the grammatical features of words by using e-magyar, an NLP tool chain for Hungarian texts (level2).

Elements and attributes

Level1 -- annotation of structural units and adding metadata to texts

  • <ns1:authorGender/> : sex of author
    • M : male
    • F : female
  • <ns1:size/> : size of the novel
    • short : 10 000 -- 49 999 words
    • medium : 50 0000 -- 99 999 words
    • long : more than 100 000 words
  • <ns1:canonicity/> : canonicity level of the novel
    • low : 0 or 1 edition after 1979
    • high : 2 or more edition after 1979
  • <ns1:timeSlot/> : time period of the first edition of the novel
    • T0 : before 1840
    • T1 : 1840--1860
    • T2 : 1860--1880
    • T3 : 1880--1900
    • T4 : 1900--1920
    • T5 : after 1920
  • <head> : title
  • <div> : part, chapter
  • <milestone> : delimiter of subchapters
  • <p> : paragraph

Level2 -- tokenization and annotation of grammatical features of words

  • <s> : sentence
  • <w> : word
  • <pc> : punctuation mark
  • @lemma : lemma
  • @pos : part of speech
  • @msd : morphosyntactic features (Universal Dependencies)

eltec folder:

The folder contains the level1 and level2 files with headers in the format of ELTeC. These files are not valid for TEI, we do not recommend to use these files.

Contributors:

License

The content of the repository is licensed under the CC BY-NC-ND license.

All texts of the corpus are in the public domain.

regenykorpusz's People

Contributors

bajzattimi avatar horvathpeti99 avatar dlazesz avatar gaborpalko avatar

Stargazers

Mice Pápai avatar Al-Hitawi Mohammed avatar  avatar

Watchers

James Cloos avatar  avatar  avatar  avatar

Forkers

sarkozizsofia

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.