Giter Club home page Giter Club logo

morcus-net's People

Contributors

kyrias avatar nkprasad12 avatar reidanderson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

morcus-net's Issues

Define efficient storage / wire format for processed texts

The data we want to store / send would be:

  1. Token
  2. List of (Lemma, POS tag, long vowels)

We could use protobuf for this: https://www.npmjs.com/package//protobufjs
One possible format might be:

message TokenAnalysis {
  // Lemma form for this analysis. May need a more structured type for principal parts.
  str lemma = 1;

  // POS tags for this lemma. This may also be a proto or a hand rolled efficient format.
  // Order matters, ranked by preference.
  repeated str tag = 2;

  // Indices of the macrons relative to the start of the token (not the text).
  repeated int macron_indices = 3;
}

message TextData {
  // The index of the first character of this token in the original text.
  int start_index = 1;

  // The length of this token.
  int token_length = 2;

  // Possible analyses for this token. Order matters, ranked by preference.
  repeated TokenAnalysis analyses = 3;
}

We could also try a hand rolled string version of this format, but we'd need to keep that in sync between Typescript and Python since this needs to be written by the NLP pieces and read by the client and potentially browser.

Investigate tokenization approaches

Does CLTK have algorithms to handle this?

Regarding quisque (and friends):
We looked at it a bit today in class. Tom agreed that his sense was that it was rarer than having the word quisque, but when you get into the oblique cases, and certain forms, there are definitely plenty of attestations. Like we searched "quorumque" as one of the most obviously fringe cases on PHI and scrolling through saw mostly "et quorum", but searched "quibusque" and saw a fair mix of each, in the brief time that we looked.

It was a bit frustrating looking through because it felt like cases where as a human it was very easy to quickly see which it was, but as a machine it would be hard to train. But I am sure still trainable, since there are contextual factors a human is picking up on that you could teach a bot too

Write code that runs the local macronizer on a `FullText`

Minimal example of usage:
    from macronizer import Macronizer
    macronizer = Macronizer()
    macronizedtext = macronizer.macronize("Iam primum omnium satis constat Troia capta in ceteros saevitum esse Troianos")

Initializing Macronizer() may take a couple of seconds, so if you want
to mark macrons in several strings, you are better off reusing the
same Macronizer object.

We can invoke this from the command line, but it might take some work to keep the same Macronizer instance alive across all the calls.

Notes on NLP papers and research

Creating this so that it becomes a centralized place where we can share our literary findings on NLP and the related subjects. It will come in handy in the future when implementing our own versions. Also, may it serve as a good starting point to future contributors wanting to learn how to get into NLP, as the resources and their main ideas will be summed quite nicely here.

The pros of creating a single issue for the papers and not multiple are (1) organization, as we won't be losing track of other issues and (2) better effectiveness as we can navigate by simply creating hyperlinks with markdown.

Refactor processing pipeline

What is currently a Pipeline should be a Process.

The existing processing / init part can be modified to be more composable. Currently, the processing part always takes in a TextPart and outputs a ProcessedText but we should try generic types here instead.

https://docs.python.org/3/library/typing.html#user-defined-generic-types

I = TypeVar('I')
O = TypeVar('O')

class Pipeline(Generic[I, O]):

  def process_text(text: str, input: I) -> O:
    ...

Then, a Process can take a TextProducer and a series of Pipelines.

Create titles browser

Get a list of available titles from the backend and display them to the user. This can just be a list for now.

Create initial backend server

We want it to do the following:

ListTitles -> Provide an index of header information for available texts.
GetTextSections -> Provides raw text + analysis for the requested text sections.
GetDefinitions -> Provides definitions for the given lemmas

GetTextSections should probably include definitions for the rarest lemmas; this may involve some actual processing. Everything else should come directly from the processed data files.

Investigate CLTK Latin Pipeline

The CLTK default Latin pipeline seems to perform much better than RFTagger by initial examination. Things to understand:

  1. Which taggers does it use? Is it the CRF Tagger (#32)?
  2. What's the best way to improve its latency? Initial estimates suggest ~1 hour (slightly more) for all of DBG.
  3. I tested it on some samples of DBG and it did well. We need to do a systematic evaluation to see how it does vs Winge.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.