using the categorization system established by tharsen & wang, we can see how spaC

<input type="checkbox" id="" disabled=""

train a span categorizer on jeff & hantao's data about jdsw HOT 2 CLOSED

direct-phonology commented on June 8, 2024

train a span categorizer on jeff & hantao's data

from jdsw.

Comments (2)

thatbudakguy commented on June 8, 2024

write a script that converts t&w's .tsv output to prodigy's json-lines format
use the db-in recipe to load all of the data into a database
use prodigy train to auto-infer the suggestion function for the spancat and test out training one
try out train-curve to see how the model responds to more or less data
add functions to the streamlit app to run the model on arbitrary input and display predictions with the span visualizer

from jdsw.

thatbudakguy commented on June 8, 2024

Jeff & Hantao's span categories include very granular info on some kinds of content, but skip over other kinds we're interested in:

TAG_MAP = {
  "E": "E",     # headword
  "B": "T",     # book title
  "BC": "C",    # commentary on book title
  "F": "F",     # fanqie
  "T": "T",     # poem title
  "J": "T",     # juan number
  "C": "C",     # commentary on headword
  "CF": "F",    # fanqie reading for char in commentary
  "CC": "C",    # commentary on commentary
  "S": "T",     # section title
  "SC": "C",    # commentary on section title
  "SF": "F",    # fanqie reading for char in section title
  "SS": "T",    # sub-section title
  "SSC": "C",   # commentary on sub-section title
  "SSF": "F",   # fanqie reading for char in sub-section title
}

I determined that training a model based on this data doesn't really fit our research question. We can already identify fanqie without a model, and that's mostly what this data does, so there doesn't seem to be much point in pursuing it (except perhaps later to aid in detecting which of the characters in the headword is being annotated).

from jdsw.

Recommend Projects