proiel / proiel-treebank Goto Github PK

Official releases of the PROIEL treebank of ancient Indo-European languages

language linguistics latin corpus treebank ancient-languages ancient-greek gothic2 armenian old-church-slavonic

proiel-treebank's Introduction

As of April 2023, releases of the PROIEL Treebank have moved to https://github.com/syntacticus/syntacticus-treebank-data.

The PROIEL Treebank

The PROIEL Treebank is a dependency treebank with morphosyntactic and information-structure annotation. It includes texts in several ancient Indo-European languages and is freely available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.

Please cite as

Dag T. T. Haug and Marius L. Jøhndal. 2008. 'Creating a Parallel Treebank of the Old Indo-European Bible Translations'. In Caroline Sporleder and Kiril Ribarov (eds.). Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008) (2008), pp. 27-34.

Releases of the PROIEL Treebank are hosted on Github.

The following texts are included in this release of the treebank:

Text	Language	Filename	Size
The Greek New Testament (ed. Tischendorf 1869)	Ancient Greek	greek-nt	140,763 tokens
The Armenian New Testament (ed. Künzle 1984)	Classical Armenian	armenian-nt	23,513 tokens
The Gothic Bible (ed. Streitberg 1919)	Gothic	gothic-nt	57,211 tokens
Codex Marianus (ed. Jagić 1883)	Old Church Slavonic	marianus	58,269 tokens
Jerome's Vulgate	Latin	latin-nt	112,454 tokens
Caesar, Commentarii belli Gallici (ed. Holmes 1914)	Latin	caes-gal	28,607 tokens
Cicero, De officiis (ed. Miller 1913)	Latin	cic-off	10,644 tokens
Cicero, Epistulae ad Atticum (ed. Purser 1901)	Latin	cic-att	42,855 tokens
Palladius, Opus agriculturae (ed. Schmitt 1898)	Latin	pal-agr	12,148 tokens
Peregrinatio Aetheriae (ed. Heraeus 1908)	Latin	per-aeth	18,356 tokens
Herodotus, Histories (ed. Godley 1920)	Ancient Greek	hdt	85,080 tokens
Sphrantzes, Chronicles (post-1453) (ed. Grecu 1966)	Ancient Greek	chron	24,612 tokens

(The 'size' column in the table above shows the number of annotated tokens in a text. The number of tokens will be slightly larger than the number of words in the original printed edition as some words have been split into multiple tokens and some tokens have been inserted during annotation.)

Please see the XML files for detailed metadata and a full list of contributors.

Some sentences have not yet been annotated. This is an overview of where in the texts unannotated sentences occur:

Sections in which more than half of sentences have not yet been annotated:

armenian-nt: JOHN 1-21, MATT 1-28, MARK 1-16
caes-gal: 5.8-5.58, books 7, book 8
cic-att: 6.2-6.9, 7.2-7.9, 7.11-7.26, 8.1-8.16
cic-off: 1.114-1.161, book 2, book 3
greek-nt: HEB 13, 1PET 3-5, 2PET 1-3, 1JOHN 1-5, 2JOHN 1, 3JOHN 1, JUDE 1
hdt: 1.70, 1.127-1.130, 1.200, book 2, book 3, 4.1-4.156, 5.94-5.101, 6.82, 6.86, 7.1, 7.31, 8.8-8.144, book 9
latin-nt: COL 3-4, 1TIM 1-6, 2TIM 1-3, HEB 1-13, JAS 1-5, 1PET 1-5, 2PET 1-3, 1JOHN 1-5, 2JOHN 1, 3JOHN 1
pal-agr: 2.12, 3.13-3.34, books 4-14

Sections or section ranges in which there are gaps:

armenian-nt: LUKE 3
caes-gal: 6.36
cic-att: 1.17-1.20, 2.3-2.24, 3.20-3.23, 4.2-4.19, 5.2-5.21, 6.1, 7.1
cic-off: 1.7-1.10, 1.38, 1.48, 1.61, 1.100, 1.106, 1.112, 1.133
hdt: 1.45-1.69, 1.126, 1.141-1.216, 4.157-4.198, 5.1-5.109, 6.12-6.138, 7.2-7.198, 7.220-7.234, 8.3-8.7
latin-nt: ACTS 21-28, ROM 11, ROM 13, GAL 1-6, EPH 3-5, PHIL 1, PHIL 3, COL 1-2, 2THESS 3, 2TIM 4, JUDE 1
marianus: MATT 5, MARK 16, LUKE 2, LUKE 24, JOHN 1-2, JOHN 18, JOHN 20
pal-agr: 1.4-1.12, 1.35-1.40, 2.3, 2.9-2.23, 3.9-3.10

These gaps will be completed in future releases.

Data formats

The texts are available on two formats:

PROIEL XML: These files are the authoritative source files and the only ones that contain all available annotation. They contain the complete morphological, syntactic and information-structure annotation, as well as the complete text, including punctuation, section headers etc. The schema is defined in proiel.xsd.
CoNLL-X format

proiel-treebank's People

Contributors

Stargazers

Watchers

Forkers

rgorman biblicalhumanities cltk acoli-repo catalinamaranduc myavrum diyclassics jared-desjardins mrmarians standardgalactic gregorycrane bellahwang chrisdrymon lordfrishetti1

proiel-treebank's Issues

Infer alignment IDs for sentences and divs

Currently only objects whose alignment IDs are set explicitly upstream (for whatever reason) have their alignment IDs set in PROIEL XML. This behaviour is not obvious to end users who may expect to find alignment indicated on all objects. Given that we can easily infer alignment of sentences and div elements from token alignments, we should consider adding them in post-processing or having upstream fill them in automatically.

greek-nt.xml: 1 John, 2 John seem to be missing

1 John and 2 John seem to be missing from greek-nt.xml.

It's easy to find part of 3 John with this grep:

$ proiel-treebank: grep 3JOHN greek-nt.xml
        <token id="491353" form="ὁ" citation-part="3JOHN 1.11" lemma="ὁ" part-of-speech="S-" morphology="-s---mn--i" head-id="491354" relation="aux" presentation-after=" "/>
        <token id="491354" form="κακοποιῶν" citation-part="3JOHN 1.11" lemma="ἀγαθοποιέω" part-of-speech="V-" morphology="-sppamn--i" head-id="491356" relation="sub" presentation-after=" "/>
        <token id="491355" form="οὐχ" citation-part="3JOHN 1.11" lemma="οὐ" part-of-speech="Df" morphology="---------n" head-id="491356" relation="aux" presentation-after=" "/>
        <token id="491356" form="ἑώρακεν" citation-part="3JOHN 1.11" lemma="ὁράω" part-of-speech="V-" morphology="3sria----i" relation="pred" presentation-after=" "/>
        <token id="491357" form="τὸν" citation-part="3JOHN 1.11" lemma="ὁ" part-of-speech="S-" morphology="-s---ma--i" head-id="491358" relation="aux" presentation-after=" "/>
        <token id="491358" form="θεόν" citation-part="3JOHN 1.11" lemma="θεός" part-of-speech="Nb" morphology="-s---ma--i" head-id="491356" relation="obj" presentation-after="."/>

I cannot find the other two epistles of John:

$ proiel-treebank: grep 2JOHN greek-nt.xml

$ proiel-treebank: grep 1JOHN greek-nt.xml

greek-nt.xml: 2 Peter is missing.

2 Peter seems to be missing in greek-nt.xml.

latin-nt.xml : MATT 17.15

https://www.perseus.tufts.edu/hopper/text?doc=Matthew+17.15&fromdoc=Perseus%3Atext%3A1999.02.0060

Invalid token alignments

A few tokens in latin-nt are aligned to non-existing tokens in greek-nt:

latin-nt.xml is invalid
* Token 492183: aligned to token greek-nt:491658 which does not exist
* Token 492184: aligned to token greek-nt:491659 which does not exist
* Token 492185: aligned to token greek-nt:491660 which does not exist
* Token 492186: aligned to token greek-nt:491661 which does not exist
* Token 492187: aligned to token greek-nt:491662 which does not exist
* Token 492188: aligned to token greek-nt:491663 which does not exist
* Token 492189: aligned to token greek-nt:491664 which does not exist

Vulgate and Punctuation

In the XML file, one can read that punctuation was added from the Clementine Text Project, but I cannot see any: is the treebank file with punctuation available somewhere?

greek-nt.xml: Jude contains only Jude 1:22-Jude 1:23

In greek-nt.xml, Jude contains only Jude 1:22-Jude 1:23.

    <div id="313">
      <title>Jude 1</title>
      <sentence id="32923" status="reviewed">
        <token id="762576" form="καὶ" citation-part="JUDE 1.22" lemma="καί" part-of-speech="C-" morphology="---------n" head-id="762579" relation="aux" presentation-after=" "/>
        <token id="762577" form="οὓς" citation-part="JUDE 1.22" lemma="ὁ" part-of-speech="Pp" morphology="3p---ma--i" head-id="762579" relation="obj" presentation-after=" "/>
        <token id="762578" form="μὲν" citation-part="JUDE 1.22" lemma="μέν" part-of-speech="Df" morphology="---------n" head-id="762579" relation="aux" presentation-after=" "/>
        <token id="762579" form="ἐλέγχετε" citation-part="JUDE 1.22" lemma="ἐλέγχω" part-of-speech="V-" morphology="2ppma----i" relation="pred" presentation-after=" "/>
        <token id="762580" form="διακρινομένους" citation-part="JUDE 1.22" lemma="διακρίνω" part-of-speech="V-" morphology="-pppmma--i" head-id="762579" relation="xadv" presentation-after=", ">
          <slash target-id="762577" relation="xsub"/>
        </token>
      </sentence>
      <sentence id="57047" status="reviewed">
        <token id="762581" form="οὓς" citation-part="JUDE 1.23" lemma="ὁ" part-of-speech="Pp" morphology="3p---ma--i" head-id="762583" relation="obj" presentation-before=" " presentation-after=" "/>
        <token id="762582" form="δὲ" citation-part="JUDE 1.23" lemma="δέ" part-of-speech="Df" morphology="---------n" head-id="762583" relation="aux" presentation-after=" "/>
        <token id="762583" form="σῴζετε" citation-part="JUDE 1.23" lemma="σῴζω" part-of-speech="V-" morphology="2ppma----i" relation="pred" presentation-after=" "/>
        <token id="762584" form="ἐκ" citation-part="JUDE 1.23" lemma="ἐκ" part-of-speech="R-" morphology="---------n" head-id="762586" relation="obl" presentation-after=" "/>
        <token id="762585" form="πυρὸς" citation-part="JUDE 1.23" lemma="πῦρ" part-of-speech="Nb" morphology="-s---ng--i" head-id="762584" relation="obl" presentation-after=" "/>
        <token id="762586" form="ἁρπάζοντες" citation-part="JUDE 1.23" lemma="ἁρπάζω" part-of-speech="V-" morphology="-pppamn--i" head-id="762583" relation="xadv" presentation-after=", ">
          <slash target-id="762583" relation="xsub"/>
        </token>
      </sentence>
      <sentence id="57048" status="reviewed" presentation-after=" ">
        <token id="762587" form="οὓς" citation-part="JUDE 1.23" lemma="ὁ" part-of-speech="Pp" morphology="3p---ma--i" head-id="762589" relation="obj" presentation-after=" "/>
        <token id="762588" form="δὲ" citation-part="JUDE 1.23" lemma="δέ" part-of-speech="Df" morphology="---------n" head-id="762589" relation="aux" presentation-after=" "/>
        <token id="762589" form="ἐλεᾶτε" citation-part="JUDE 1.23" lemma="ἐλεέω" part-of-speech="V-" morphology="2ppma----i" relation="pred" presentation-after=" "/>
        <token id="762590" form="ἐν" citation-part="JUDE 1.23" lemma="ἐν" part-of-speech="R-" morphology="---------n" head-id="762589" relation="adv" presentation-after=" "/>
        <token id="762591" form="φόβῳ" citation-part="JUDE 1.23" lemma="φόβος" part-of-speech="Nb" morphology="-s---md--i" head-id="762590" relation="obl" presentation-after=", "/>
        <token id="762592" form="μισοῦντες" citation-part="JUDE 1.23" lemma="μισέω" part-of-speech="V-" morphology="-pppamn--i" head-id="762589" relation="xadv" presentation-after=" ">
          <slash target-id="762589" relation="xsub"/>
        </token>
        <token id="762593" form="καὶ" citation-part="JUDE 1.23" lemma="καί#1" part-of-speech="Df" morphology="---------n" head-id="762599" relation="aux" presentation-after=" "/>
        <token id="762594" form="τὸν" citation-part="JUDE 1.23" lemma="ὁ" part-of-speech="S-" morphology="-s---ma--i" head-id="762599" relation="aux" presentation-after=" "/>
        <token id="762595" form="ἀπὸ" citation-part="JUDE 1.23" lemma="ἀπό" part-of-speech="R-" morphology="---------n" head-id="762598" relation="ag" presentation-after=" "/>
        <token id="762596" form="τῆς" citation-part="JUDE 1.23" lemma="ὁ" part-of-speech="S-" morphology="-s---fg--i" head-id="762597" relation="aux" presentation-after=" "/>
        <token id="762597" form="σαρκὸς" citation-part="JUDE 1.23" lemma="σάρξ" part-of-speech="Nb" morphology="-s---fg--i" head-id="762595" relation="obl" presentation-after=" "/>
        <token id="762598" form="ἐσπιλωμένον" citation-part="JUDE 1.23" lemma="σπιλόω" part-of-speech="V-" morphology="-prppma--i" head-id="762599" relation="atr" presentation-after=" "/>
        <token id="762599" form="χιτῶνα" citation-part="JUDE 1.23" lemma="χιτών" part-of-speech="Nb" morphology="-s---ma--i" head-id="762592" relation="obj" presentation-after="."/>
      </sentence>
    </div>

Include detailed break-down of annotated and unannotated sections in texts

We should include a break-down of which parts of texts have been annotated and which remain to be annotated. It's probably a good idea to keep a (shorter?) version of this on the treebank webpage too, and make it easy to find from Syntacticus.

Don't understand the implementation of alignment-id attribute values

After reading the PROIEL XML specification, I was under the impression that the alignment-id attribute holds an id value from the aligned file ("the alignment-id on div, sentence and token elements must be interpreted in relation to the alignment-id on the source element"). In the latin-nt.xml file I find information which suggests it is aligned with the greek-nt.xml:

<source id="latin-nt" language="lat" alignment-id="greek-nt">
(...)
<div alignment-id="40">
    <div alignment-id="1">
      <title>Matthew 1</title>
      <sentence id="12667" status="reviewed" presentation-after=" ">
        <token id="250021" form="liber" citation-part="MATT 1.1" lemma="liber" part-of-speech="Nb" morphology="-s---mn--i" head-id="851355" relation="xobj" presentation-after=" " alignment-id="266690">

But, in the greek-nt.xml I do not find divs with id "40" or "1". True, there is a token with id "266690". The sentence element, however, is missing its alignment-id attribute. Is there a reason for the inconsistency on the level of divs and sentences? Are there plans to correct it?

We have a (modern Croatian) translation of the NT which we're aligning with the PROIEL's Greek NT. We're starting with the level of sentences. Eventually, we would like to be able to use that alignment to retrieve the Latin sentences as well -- either directly, or through the Greek aligned text. For that, it would be useful to have the Latin aligned at the sentence level as well.

Issues about automatized lemmatizing

I use the Stanza Python software from Stanford NLP group to statistically lemmatize Old Slavonic texts using proiel-treebank-20180408. Accordingly, Stanza generates for the beginning of the Lord's prayer "Отче наш , Иже еси на небесе́х ! Да святится имя Твое , да прии́дет Царствие Твое ," the following pairs of tokens and their lemmata:

[('Отче', 'отьць'), ('наш', 'насконити'), (',', ','), ('Иже', 'иже'), ('еси', 'бꙑти'), ('на', 'на'), ('небесе́х', 'небо'), ('!', '!'), ('Да', 'да'), ('святится', 'святится'), ('имя', 'имя'), ('Твое', 'свои'), (',', ','), ('да', 'да'), ('прии́дет', 'приити'), ('Царствие', 'иарствиѥ'), ('Твое', 'свои'), (',', 'бꙑти')]

I marked in bold three problematic lemmas that need correction. How should I do it?

Invalid antecedent_id

Some tokens (still) reference antecedent tokens that do not exist:

armenian-nt.xml is invalid
* Token 737694: antecedent_id references an unknown token
* Token 814655: antecedent_id references an unknown token
* Token 814656: antecedent_id references an unknown token
* Token 1169890: antecedent_id references an unknown token
* Token 822366: antecedent_id references an unknown token
* Token 1079278: antecedent_id references an unknown token
* Token 1079279: antecedent_id references an unknown token
* Token 835116: antecedent_id references an unknown token
* Token 835120: antecedent_id references an unknown token
* Token 838521: antecedent_id references an unknown token
* Token 866518: antecedent_id references an unknown token
* Token 866522: antecedent_id references an unknown token
cic-att.xml is invalid
* Token 1171944: antecedent_id references an unknown token
marianus.xml is invalid
* Token 1161855: antecedent_id references an unknown token

These should either be filtered before producing the next release or corrected by reviewers.

greek-nt.xml: 1 Peter is truncated after 1 Peter 3:4

In greek-nt.xml, 1 Peter ends with 1 Peter 3:4. The last two chapters are missing, and so is everything after verse 4 in the third chapter.

greek-nt.xml: Most of 3 John is missing, only part of 3 John 1:11 is there

Most of 3 John is missing, only part of 3 John 1:11 is there. Here are the entire contents of that book:

    <div id="312">
      <title>3 John 1</title>
      <sentence id="32875" status="reviewed" presentation-after=" ">
        <token id="491353" form="ὁ" citation-part="3JOHN 1.11" lemma="ὁ" part-of-speech="S-" morphology="-s---mn--i" head-id="491354" relation="aux" presentation-after=" "/>
        <token id="491354" form="κακοποιῶν" citation-part="3JOHN 1.11" lemma="ἀγαθοποιέω" part-of-speech="V-" morphology="-sppamn--i" head-id="491356" relation="sub" presentation-after=" "/>
        <token id="491355" form="οὐχ" citation-part="3JOHN 1.11" lemma="οὐ" part-of-speech="Df" morphology="---------n" head-id="491356" relation="aux" presentation-after=" "/>
        <token id="491356" form="ἑώρακεν" citation-part="3JOHN 1.11" lemma="ὁράω" part-of-speech="V-" morphology="3sria----i" relation="pred" presentation-after=" "/>
        <token id="491357" form="τὸν" citation-part="3JOHN 1.11" lemma="ὁ" part-of-speech="S-" morphology="-s---ma--i" head-id="491358" relation="aux" presentation-after=" "/>
        <token id="491358" form="θεόν" citation-part="3JOHN 1.11" lemma="θεός" part-of-speech="Nb" morphology="-s---ma--i" head-id="491356" relation="obj" presentation-after="."/>
      </sentence>
    </div>