nkprasad12 / morcus-net Goto Github PK
View Code? Open in Web Editor NEWSource code for Morcus Latin Tools.
Home Page: https://morcus.net
License: GNU General Public License v3.0
Source code for Morcus Latin Tools.
Home Page: https://morcus.net
License: GNU General Public License v3.0
(i.e right now we have some issues if we wanted to create another module and import it)
The data we want to store / send would be:
We could use protobuf for this: https://www.npmjs.com/package//protobufjs
One possible format might be:
message TokenAnalysis {
// Lemma form for this analysis. May need a more structured type for principal parts.
str lemma = 1;
// POS tags for this lemma. This may also be a proto or a hand rolled efficient format.
// Order matters, ranked by preference.
repeated str tag = 2;
// Indices of the macrons relative to the start of the token (not the text).
repeated int macron_indices = 3;
}
message TextData {
// The index of the first character of this token in the original text.
int start_index = 1;
// The length of this token.
int token_length = 2;
// Possible analyses for this token. Order matters, ranked by preference.
repeated TokenAnalysis analyses = 3;
}
We could also try a hand rolled string version of this format, but we'd need to keep that in sync between Typescript and Python since this needs to be written by the NLP pieces and read by the client and potentially browser.
Does CLTK have algorithms to handle this?
Regarding quisque (and friends):
We looked at it a bit today in class. Tom agreed that his sense was that it was rarer than having the word quisque, but when you get into the oblique cases, and certain forms, there are definitely plenty of attestations. Like we searched "quorumque" as one of the most obviously fringe cases on PHI and scrolling through saw mostly "et quorum", but searched "quibusque" and saw a fair mix of each, in the brief time that we looked.
It was a bit frustrating looking through because it felt like cases where as a human it was very easy to quickly see which it was, but as a machine it would be hard to train. But I am sure still trainable, since there are contextual factors a human is picking up on that you could teach a bot too
Currently we're not doing anything with it:
morcus-net/src/py/utils/pipeline.py
Line 109 in c427f03
We can use the lemma
and features
of each element of the analyze
output, i.e
doc = self._nlp.analyze(text_part.text)
doc[0].lemma
doc[0].features
Full details: https://docs.cltk.org/en/latest/quickstart.html
Minimal example of usage:
from macronizer import Macronizer
macronizer = Macronizer()
macronizedtext = macronizer.macronize("Iam primum omnium satis constat Troia capta in ceteros saevitum esse Troianos")
Initializing Macronizer() may take a couple of seconds, so if you want
to mark macrons in several strings, you are better off reusing the
same Macronizer object.
We can invoke this from the command line, but it might take some work to keep the same Macronizer
instance alive across all the calls.
https://github.com/cltk/cltk/blob/1bf19745a35829bcf6a91d3233ef061e4679de67/src/cltk/dependency/stanza.py#L242
CLTK is passing processors = "tokenize,mwt,pos,lemma,depparse"
to Stanza.
These are explained here: https://stanfordnlp.github.io/stanza/pipeline.html#processors
We can use this to call Stanza directly without CLTK and use the same tokenization.
Process started with bdcd8b9
for example, excīdō and excidō
Creating this so that it becomes a centralized place where we can share our literary findings on NLP and the related subjects. It will come in handy in the future when implementing our own versions. Also, may it serve as a good starting point to future contributors wanting to learn how to get into NLP, as the resources and their main ideas will be summed quite nicely here.
The pros of creating a single issue for the papers and not multiple are (1) organization, as we won't be losing track of other issues and (2) better effectiveness as we can navigate by simply creating hyperlinks with markdown.
We should investigate both Morpheus and https://github.com/biblissima/collatinus
L&S can be downloaded from here: https://github.com/PerseusDL/lexica/blob/c14c9dc7858cc23787d3c0b55152550ff458de35/CTS_XML_TEI/perseus/pdllex/lat/ls/lat.ls.perseus-eng2.xml
We probably don't want to actually check this in for size reasons.
digital editions of several Latin dictionaries https://latin-dict.github.io/
We'll also need to make a navigation bar as a result.
What is currently a Pipeline
should be a Process
.
The existing processing / init part can be modified to be more composable. Currently, the processing part always takes in a TextPart
and outputs a ProcessedText
but we should try generic types here instead.
https://docs.python.org/3/library/typing.html#user-defined-generic-types
I = TypeVar('I')
O = TypeVar('O')
class Pipeline(Generic[I, O]):
def process_text(text: str, input: I) -> O:
...
Then, a Process can take a TextProducer and a series of Pipelines.
Get a list of available titles from the backend and display them to the user. This can just be a list for now.
Winge 2015 cites the LASLA corpus being the biggest of all Laitn annotated corpora, containing over 2)000_000 manually confirmed lemmata tagged. However, at the time of writing the thesis, it wasn't yet fully queriable and available. This might have changed, and it would be of immense help.
note to self: start with the links below
https://lila-erc.eu/query/
https://github.com/CIRCSE/LASLA/tree/main/texts
Tyler Kirby suggested that I look into the existing CRF Tagger in CLTK.
This may be a link to the paper that inspired this: https://aclanthology.org/L16-1240/
Link in the API: https://docs.cltk.org/en/latest/cltk.tag.html#cltk.tag.pos.POSTag.tag_crf
https://opengreekandlatin.org/ -> other resources
http://www.digitalhumanities.org/dhq/about/about.html -> to keep abreast in the field
Folko approach below:
GameData.txt
We want it to do the following:
ListTitles
-> Provide an index of header information for available texts.
GetTextSections
-> Provides raw text + analysis for the requested text sections.
GetDefinitions
-> Provides definitions for the given lemmas
GetTextSections
should probably include definitions for the rarest lemmas; this may involve some actual processing. Everything else should come directly from the processed data files.
Add one for the parsing logic
The annotations would be better organized in a wiki page.
This should mostly reference the README in the macronizer project itself
Add pytype validation to the presubmit checks.
https://github.com/google/pytype
The CLTK default Latin pipeline seems to perform much better than RFTagger by initial examination. Things to understand:
Should contain input box, submit button, output box.
A tricky case was
<p> Whatever <add> hi </add> </p>
We need to verify that
<p> Whatever <add> hi </add> hello <add> wtf </add> Hi </p>
would also work.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.