nkprasad12 / morcus-net Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 3.0 2.97 MB

Source code for Morcus Latin Tools.

Home Page: https://morcus.net

License: GNU General Public License v3.0

TypeScript 90.65% JavaScript 0.03% Python 5.30% HTML 3.65% Procfile 0.01% Shell 0.26% Dockerfile 0.11%

morcus-net's People

Contributors

Stargazers

Watchers

Forkers

kyrias genericgithubacct lucianpastures

morcus-net's Issues

Set up typescript + processors

(i.e right now we have some issues if we wanted to create another module and import it)

Define efficient storage / wire format for processed texts

The data we want to store / send would be:

Token
List of (Lemma, POS tag, long vowels)

We could use protobuf for this: https://www.npmjs.com/package//protobufjs
One possible format might be:

message TokenAnalysis {
  // Lemma form for this analysis. May need a more structured type for principal parts.
  str lemma = 1;

  // POS tags for this lemma. This may also be a proto or a hand rolled efficient format.
  // Order matters, ranked by preference.
  repeated str tag = 2;

  // Indices of the macrons relative to the start of the token (not the text).
  repeated int macron_indices = 3;
}

message TextData {
  // The index of the first character of this token in the original text.
  int start_index = 1;

  // The length of this token.
  int token_length = 2;

  // Possible analyses for this token. Order matters, ranked by preference.
  repeated TokenAnalysis analyses = 3;
}

We could also try a hand rolled string version of this format, but we'd need to keep that in sync between Typescript and Python since this needs to be written by the NLP pieces and read by the client and potentially browser.

Investigate tokenization approaches

Does CLTK have algorithms to handle this?

Regarding quisque (and friends):
We looked at it a bit today in class. Tom agreed that his sense was that it was rarer than having the word quisque, but when you get into the oblique cases, and certain forms, there are definitely plenty of attestations. Like we searched "quorumque" as one of the most obviously fringe cases on PHI and scrolling through saw mostly "et quorum", but searched "quibusque" and saw a fair mix of each, in the brief time that we looked.

It was a bit frustrating looking through because it felt like cases where as a human it was very easy to quickly see which it was, but as a machine it would be hard to train. But I am sure still trainable, since there are contextual factors a human is picking up on that you could teach a bot too

Write PoS Data from NLP Pipeline to output file

Currently we're not doing anything with it:

morcus-net/src/py/utils/pipeline.py

Line 109 in c427f03

print("TODO: actually do something here.")

We can use the lemma and features of each element of the analyze output, i.e

doc = self._nlp.analyze(text_part.text)
doc[0].lemma
doc[0].features

Full details: https://docs.cltk.org/en/latest/quickstart.html

Write code that runs the local macronizer on a `FullText`

Minimal example of usage:
    from macronizer import Macronizer
    macronizer = Macronizer()
    macronizedtext = macronizer.macronize("Iam primum omnium satis constat Troia capta in ceteros saevitum esse Troianos")

Initializing Macronizer() may take a couple of seconds, so if you want
to mark macrons in several strings, you are better off reusing the
same Macronizer object.

We can invoke this from the command line, but it might take some work to keep the same Macronizer instance alive across all the calls.

Set up auto-formatter

Pass self tokenized data to stanza

https://github.com/cltk/cltk/blob/1bf19745a35829bcf6a91d3233ef061e4679de67/src/cltk/dependency/stanza.py#L242
CLTK is passing processors = "tokenize,mwt,pos,lemma,depparse" to Stanza.

These are explained here: https://stanfordnlp.github.io/stanza/pipeline.html#processors
We can use this to call Stanza directly without CLTK and use the same tokenization.

Parse a single perseus work

Process started with bdcd8b9

Display styled L&S entries

Add autocomplete for dictionary search bar

Add automatic abbreviation expansion

Investigate how to handle ambiguous terms

for example, excīdō and excidō

Notes on NLP papers and research

Creating this so that it becomes a centralized place where we can share our literary findings on NLP and the related subjects. It will come in handy in the future when implementing our own versions. Also, may it serve as a good starting point to future contributors wanting to learn how to get into NLP, as the resources and their main ideas will be summed quite nicely here.

The pros of creating a single issue for the papers and not multiple are (1) organization, as we won't be losing track of other issues and (2) better effectiveness as we can navigate by simply creating hyperlinks with markdown.

Use CLTK default tagger with the Alatius macronizer approach

Create a morphological analyzer

We should investigate both Morpheus and https://github.com/biblissima/collatinus

Fix automatable errors in Perseus

L&S can be downloaded from here: https://github.com/PerseusDL/lexica/blob/c14c9dc7858cc23787d3c0b55152550ff458de35/CTS_XML_TEI/perseus/pdllex/lat/ls/lat.ls.perseus-eng2.xml
We probably don't want to actually check this in for size reasons.

Investigate approaches for proper noun detection

this project is interesting for pulling out proper nouns http://www.digitalhumanities.org/dhq/vol/15/4/000574/000574.html
figure out how Alatius is doing it

Investigate Latin dictionary options

digital editions of several Latin dictionaries https://latin-dict.github.io/

Investigate usefulness of BERT

https://arxiv.org/abs/2009.10053

Create dictionary tab in the UI

We'll also need to make a navigation bar as a result.

Refactor processing pipeline

What is currently a Pipeline should be a Process.

The existing processing / init part can be modified to be more composable. Currently, the processing part always takes in a TextPart and outputs a ProcessedText but we should try generic types here instead.

https://docs.python.org/3/library/typing.html#user-defined-generic-types

I = TypeVar('I')
O = TypeVar('O')

class Pipeline(Generic[I, O]):

  def process_text(text: str, input: I) -> O:
    ...

Then, a Process can take a TextProducer and a series of Pipelines.

Get NLP processed data from Python in Node

Run the macronizer locally on the sample text

Create titles browser

Get a list of available titles from the backend and display them to the user. This can just be a list for now.

Use Whitaker's words for definitions

https://github.com/mk270/whitakers-words

Set up branching gated on CI presubmits

Investigate Kirby's Deep PoS Tagger

https://github.com/TylerKirby/deep-latin-pos-tagger

Add inflection support during search

Macronized evaluation data

Some of the texts are hand macronzied: https://hypotactic.com/latin/index.html?Use_Id=about
Many of these are https://dcc.dickinson.edu/home-page-latin

Investigate availability of the LiLa corpus from LASLA

Winge 2015 cites the LASLA corpus being the biggest of all Laitn annotated corpora, containing over 2)000_000 manually confirmed lemmata tagged. However, at the time of writing the thesis, it wasn't yet fully queriable and available. This might have changed, and it would be of immense help.

note to self: start with the links below
https://lila-erc.eu/query/
https://github.com/CIRCSE/LASLA/tree/main/texts

Try to integrate Alatius with CLTK

http://cltk.org/
https://aclanthology.org/2021.acl-demo.3.pdf

Create websockets based processing

Add styling for navigation buttons

Write contribution guide

Investigate CRF Taggers

Tyler Kirby suggested that I look into the existing CRF Tagger in CLTK.

This may be a link to the paper that inspired this: https://aclanthology.org/L16-1240/
Link in the API: https://docs.cltk.org/en/latest/cltk.tag.html#cltk.tag.pos.POSTag.tag_crf

Other resources to investigate

https://opengreekandlatin.org/ -> other resources
http://www.digitalhumanities.org/dhq/about/about.html -> to keep abreast in the field

Set up CI

Investigate approaches for semantic tagging

Folko approach below:
GameData.txt

Create initial backend server

We want it to do the following:

ListTitles -> Provide an index of header information for available texts.
GetTextSections -> Provides raw text + analysis for the requested text sections.
GetDefinitions -> Provides definitions for the given lemmas

GetTextSections should probably include definitions for the rarest lemmas; this may involve some actual processing. Everything else should come directly from the processed data files.

Which taggers does it use? Is it the CRF Tagger (#32)?
What's the best way to improve its latency? Initial estimates suggest ~1 hour (slightly more) for all of DBG.
I tested it on some samples of DBG and it did well. We need to do a systematic evaluation to see how it does vs Winge.

<p> Whatever <add> hi </add> </p>

We need to verify that

<p> Whatever <add> hi </add> hello <add> wtf </add> Hi </p>

would also work.