Giter Club home page Giter Club logo

proiel-treebank's Introduction

PROIEL treebank utility library

Status

Gem Version

Description

This is a utility library for reading and manipulating treebanks that use the PROIEL annotation scheme and the PROIEL XML-based interchange format.

Installation

This library requires Ruby >= 2.7 (as this is what Nokogiri 1.14.x requires).

Install as

gem install proiel

Getting started

The recommended way to use this library in your application is with bundler. Create a Gemfile with the following content:

source 'https://rubygems.org'
gem 'proiel', '~> 1.0'

and then execute

bundle

To download a sample treebank, initialize a new git repository and add the PROIEL treebank as a submodule:

git init
mkdir vendor
git submodule add --depth 1 https://github.com/proiel/proiel-treebank.git vendor/proiel-treebank

Here is a skeleton programme to get you started. Save this as myproject.rb:

#!/usr/bin/env ruby
require 'proiel'

tb = PROIEL::Treebank.new
Dir[File.join('vendor', 'proiel-treebank', '*.xml')].each do |filename|
  puts "Reading #{filename}..."
  tb.load_from_xml(filename)
end

tb.sources.each do |source|
  source.divs.each do |div|
    div.sentences.each do |sentence|
      sentence.tokens.each do |token|
        # Do something
      end
    end
  end
end

You can now run this as:

bundle exec ruby myproject.rb

See the wiki for more information.

Versioning

proiel aims to adhere to Semantic Versioning 2.0.0. This means that a patch version or minor version should not break backward compatibility of a public API, and that breaking changes should only be introduced with new major versions. When specifying a dependency on this gem it is best practice to use a pessimistic version constraint with two digits of precision:

spec.add_dependency 'proiel', '~> 1.0'

Development

Check out the git repository from GitHub and run bin/setup to install all development dependencies. Then run rake to run the tests.

You can also run bin/console for an interactive prompt to experiment with.

To install a development version of this gem, run bundle exec rake install.

To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the gem to rubygems.org.

Documentation

Documentation can be generated using YARD:

yard

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/proiel/proiel.

proiel-treebank's People

Contributors

mlj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

proiel-treebank's Issues

Invalid token alignments

A few tokens in latin-nt are aligned to non-existing tokens in greek-nt:

latin-nt.xml is invalid
* Token 492183: aligned to token greek-nt:491658 which does not exist
* Token 492184: aligned to token greek-nt:491659 which does not exist
* Token 492185: aligned to token greek-nt:491660 which does not exist
* Token 492186: aligned to token greek-nt:491661 which does not exist
* Token 492187: aligned to token greek-nt:491662 which does not exist
* Token 492188: aligned to token greek-nt:491663 which does not exist
* Token 492189: aligned to token greek-nt:491664 which does not exist

greek-nt.xml: Most of 3 John is missing, only part of 3 John 1:11 is there

Most of 3 John is missing, only part of 3 John 1:11 is there. Here are the entire contents of that book:

    <div id="312">
      <title>3 John 1</title>
      <sentence id="32875" status="reviewed" presentation-after=" ">
        <token id="491353" form="" citation-part="3JOHN 1.11" lemma="" part-of-speech="S-" morphology="-s---mn--i" head-id="491354" relation="aux" presentation-after=" "/>
        <token id="491354" form="κακοποιῶν" citation-part="3JOHN 1.11" lemma="ἀγαθοποιέω" part-of-speech="V-" morphology="-sppamn--i" head-id="491356" relation="sub" presentation-after=" "/>
        <token id="491355" form="οὐχ" citation-part="3JOHN 1.11" lemma="οὐ" part-of-speech="Df" morphology="---------n" head-id="491356" relation="aux" presentation-after=" "/>
        <token id="491356" form="ἑώρακεν" citation-part="3JOHN 1.11" lemma="ὁράω" part-of-speech="V-" morphology="3sria----i" relation="pred" presentation-after=" "/>
        <token id="491357" form="τὸν" citation-part="3JOHN 1.11" lemma="" part-of-speech="S-" morphology="-s---ma--i" head-id="491358" relation="aux" presentation-after=" "/>
        <token id="491358" form="θεόν" citation-part="3JOHN 1.11" lemma="θεός" part-of-speech="Nb" morphology="-s---ma--i" head-id="491356" relation="obj" presentation-after="."/>
      </sentence>
    </div>

Don't understand the implementation of alignment-id attribute values

After reading the PROIEL XML specification, I was under the impression that the alignment-id attribute holds an id value from the aligned file ("the alignment-id on div, sentence and token elements must be interpreted in relation to the alignment-id on the source element"). In the latin-nt.xml file I find information which suggests it is aligned with the greek-nt.xml:

<source id="latin-nt" language="lat" alignment-id="greek-nt">
(...)
<div alignment-id="40">
    <div alignment-id="1">
      <title>Matthew 1</title>
      <sentence id="12667" status="reviewed" presentation-after=" ">
        <token id="250021" form="liber" citation-part="MATT 1.1" lemma="liber" part-of-speech="Nb" morphology="-s---mn--i" head-id="851355" relation="xobj" presentation-after=" " alignment-id="266690">

But, in the greek-nt.xml I do not find divs with id "40" or "1". True, there is a token with id "266690". The sentence element, however, is missing its alignment-id attribute. Is there a reason for the inconsistency on the level of divs and sentences? Are there plans to correct it?

We have a (modern Croatian) translation of the NT which we're aligning with the PROIEL's Greek NT. We're starting with the level of sentences. Eventually, we would like to be able to use that alignment to retrieve the Latin sentences as well -- either directly, or through the Greek aligned text. For that, it would be useful to have the Latin aligned at the sentence level as well.

Invalid antecedent_id

Some tokens (still) reference antecedent tokens that do not exist:

armenian-nt.xml is invalid
* Token 737694: antecedent_id references an unknown token
* Token 814655: antecedent_id references an unknown token
* Token 814656: antecedent_id references an unknown token
* Token 1169890: antecedent_id references an unknown token
* Token 822366: antecedent_id references an unknown token
* Token 1079278: antecedent_id references an unknown token
* Token 1079279: antecedent_id references an unknown token
* Token 835116: antecedent_id references an unknown token
* Token 835120: antecedent_id references an unknown token
* Token 838521: antecedent_id references an unknown token
* Token 866518: antecedent_id references an unknown token
* Token 866522: antecedent_id references an unknown token
cic-att.xml is invalid
* Token 1171944: antecedent_id references an unknown token
marianus.xml is invalid
* Token 1161855: antecedent_id references an unknown token

These should either be filtered before producing the next release or corrected by reviewers.

Infer alignment IDs for sentences and divs

Currently only objects whose alignment IDs are set explicitly upstream (for whatever reason) have their alignment IDs set in PROIEL XML. This behaviour is not obvious to end users who may expect to find alignment indicated on all objects. Given that we can easily infer alignment of sentences and div elements from token alignments, we should consider adding them in post-processing or having upstream fill them in automatically.

greek-nt.xml: Jude contains only Jude 1:22-Jude 1:23

In greek-nt.xml, Jude contains only Jude 1:22-Jude 1:23.

    <div id="313">
      <title>Jude 1</title>
      <sentence id="32923" status="reviewed">
        <token id="762576" form="καὶ" citation-part="JUDE 1.22" lemma="καί" part-of-speech="C-" morphology="---------n" head-id="762579" relation="aux" presentation-after=" "/>
        <token id="762577" form="οὓς" citation-part="JUDE 1.22" lemma="" part-of-speech="Pp" morphology="3p---ma--i" head-id="762579" relation="obj" presentation-after=" "/>
        <token id="762578" form="μὲν" citation-part="JUDE 1.22" lemma="μέν" part-of-speech="Df" morphology="---------n" head-id="762579" relation="aux" presentation-after=" "/>
        <token id="762579" form="ἐλέγχετε" citation-part="JUDE 1.22" lemma="ἐλέγχω" part-of-speech="V-" morphology="2ppma----i" relation="pred" presentation-after=" "/>
        <token id="762580" form="διακρινομένους" citation-part="JUDE 1.22" lemma="διακρίνω" part-of-speech="V-" morphology="-pppmma--i" head-id="762579" relation="xadv" presentation-after=", ">
          <slash target-id="762577" relation="xsub"/>
        </token>
      </sentence>
      <sentence id="57047" status="reviewed">
        <token id="762581" form="οὓς" citation-part="JUDE 1.23" lemma="" part-of-speech="Pp" morphology="3p---ma--i" head-id="762583" relation="obj" presentation-before=" " presentation-after=" "/>
        <token id="762582" form="δὲ" citation-part="JUDE 1.23" lemma="δέ" part-of-speech="Df" morphology="---------n" head-id="762583" relation="aux" presentation-after=" "/>
        <token id="762583" form="σῴζετε" citation-part="JUDE 1.23" lemma="σῴζω" part-of-speech="V-" morphology="2ppma----i" relation="pred" presentation-after=" "/>
        <token id="762584" form="ἐκ" citation-part="JUDE 1.23" lemma="ἐκ" part-of-speech="R-" morphology="---------n" head-id="762586" relation="obl" presentation-after=" "/>
        <token id="762585" form="πυρὸς" citation-part="JUDE 1.23" lemma="πῦρ" part-of-speech="Nb" morphology="-s---ng--i" head-id="762584" relation="obl" presentation-after=" "/>
        <token id="762586" form="ἁρπάζοντες" citation-part="JUDE 1.23" lemma="ἁρπάζω" part-of-speech="V-" morphology="-pppamn--i" head-id="762583" relation="xadv" presentation-after=", ">
          <slash target-id="762583" relation="xsub"/>
        </token>
      </sentence>
      <sentence id="57048" status="reviewed" presentation-after=" ">
        <token id="762587" form="οὓς" citation-part="JUDE 1.23" lemma="" part-of-speech="Pp" morphology="3p---ma--i" head-id="762589" relation="obj" presentation-after=" "/>
        <token id="762588" form="δὲ" citation-part="JUDE 1.23" lemma="δέ" part-of-speech="Df" morphology="---------n" head-id="762589" relation="aux" presentation-after=" "/>
        <token id="762589" form="ἐλεᾶτε" citation-part="JUDE 1.23" lemma="ἐλεέω" part-of-speech="V-" morphology="2ppma----i" relation="pred" presentation-after=" "/>
        <token id="762590" form="ἐν" citation-part="JUDE 1.23" lemma="ἐν" part-of-speech="R-" morphology="---------n" head-id="762589" relation="adv" presentation-after=" "/>
        <token id="762591" form="φόβῳ" citation-part="JUDE 1.23" lemma="φόβος" part-of-speech="Nb" morphology="-s---md--i" head-id="762590" relation="obl" presentation-after=", "/>
        <token id="762592" form="μισοῦντες" citation-part="JUDE 1.23" lemma="μισέω" part-of-speech="V-" morphology="-pppamn--i" head-id="762589" relation="xadv" presentation-after=" ">
          <slash target-id="762589" relation="xsub"/>
        </token>
        <token id="762593" form="καὶ" citation-part="JUDE 1.23" lemma="καί#1" part-of-speech="Df" morphology="---------n" head-id="762599" relation="aux" presentation-after=" "/>
        <token id="762594" form="τὸν" citation-part="JUDE 1.23" lemma="" part-of-speech="S-" morphology="-s---ma--i" head-id="762599" relation="aux" presentation-after=" "/>
        <token id="762595" form="ἀπὸ" citation-part="JUDE 1.23" lemma="ἀπό" part-of-speech="R-" morphology="---------n" head-id="762598" relation="ag" presentation-after=" "/>
        <token id="762596" form="τῆς" citation-part="JUDE 1.23" lemma="" part-of-speech="S-" morphology="-s---fg--i" head-id="762597" relation="aux" presentation-after=" "/>
        <token id="762597" form="σαρκὸς" citation-part="JUDE 1.23" lemma="σάρξ" part-of-speech="Nb" morphology="-s---fg--i" head-id="762595" relation="obl" presentation-after=" "/>
        <token id="762598" form="ἐσπιλωμένον" citation-part="JUDE 1.23" lemma="σπιλόω" part-of-speech="V-" morphology="-prppma--i" head-id="762599" relation="atr" presentation-after=" "/>
        <token id="762599" form="χιτῶνα" citation-part="JUDE 1.23" lemma="χιτών" part-of-speech="Nb" morphology="-s---ma--i" head-id="762592" relation="obj" presentation-after="."/>
      </sentence>
    </div>

Issues about automatized lemmatizing

I use the Stanza Python software from Stanford NLP group to statistically lemmatize Old Slavonic texts using proiel-treebank-20180408. Accordingly, Stanza generates for the beginning of the Lord's prayer "Отче наш , Иже еси на небесе́х ! Да святится имя Твое , да прии́дет Царствие Твое ," the following pairs of tokens and their lemmata:

[('Отче', 'отьць'), ('наш', 'насконити'), (',', ','), ('Иже', 'иже'), ('еси', 'бꙑти'), ('на', 'на'), ('небесе́х', 'небо'), ('!', '!'), ('Да', 'да'), ('святится', 'святится'), ('имя', 'имя'), ('Твое', 'свои'), (',', ','), ('да', 'да'), ('прии́дет', 'приити'), ('Царствие', 'иарствиѥ'), ('Твое', 'свои'), (',', 'бꙑти')]

I marked in bold three problematic lemmas that need correction. How should I do it?

greek-nt.xml: 1 John, 2 John seem to be missing

1 John and 2 John seem to be missing from greek-nt.xml.

It's easy to find part of 3 John with this grep:

$ proiel-treebank: grep 3JOHN greek-nt.xml
        <token id="491353" form="" citation-part="3JOHN 1.11" lemma="" part-of-speech="S-" morphology="-s---mn--i" head-id="491354" relation="aux" presentation-after=" "/>
        <token id="491354" form="κακοποιῶν" citation-part="3JOHN 1.11" lemma="ἀγαθοποιέω" part-of-speech="V-" morphology="-sppamn--i" head-id="491356" relation="sub" presentation-after=" "/>
        <token id="491355" form="οὐχ" citation-part="3JOHN 1.11" lemma="οὐ" part-of-speech="Df" morphology="---------n" head-id="491356" relation="aux" presentation-after=" "/>
        <token id="491356" form="ἑώρακεν" citation-part="3JOHN 1.11" lemma="ὁράω" part-of-speech="V-" morphology="3sria----i" relation="pred" presentation-after=" "/>
        <token id="491357" form="τὸν" citation-part="3JOHN 1.11" lemma="" part-of-speech="S-" morphology="-s---ma--i" head-id="491358" relation="aux" presentation-after=" "/>
        <token id="491358" form="θεόν" citation-part="3JOHN 1.11" lemma="θεός" part-of-speech="Nb" morphology="-s---ma--i" head-id="491356" relation="obj" presentation-after="."/>

I cannot find the other two epistles of John:

$ proiel-treebank: grep 2JOHN greek-nt.xml
$ proiel-treebank: grep 1JOHN greek-nt.xml

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.