Giter Club home page Giter Club logo

periodo-reconciler's People

Contributors

dependabot-support avatar ptgolden avatar rybesh avatar tfmorris avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

tfmorris

periodo-reconciler's Issues

Improving reconciliation performance: better recall

  1. Not all values that I would expect to bring up partial matches are doing
    so (eg “Jordanian Chalcolithic” in the Perucchetti dataset, or “Possibly
    Archaic” or “Late Republican” in the Met). Also, some direct matches in the
    Europeana spreadsheet didn’t come up (e.g. “vikingatid”). Alo, in the
    Europeana dataset, lots of matches came up for “Hellenistic Period” but none
    at all for “Hellenistic Period (323-31 BC)”.

These are probably due to how I've got the string matching configured; right now I think it is biased toward higher precision and lower recall, which is maybe what we're seeing here. For the ones that you expected to bring up partial matches, can you you give me URLs of some of the definitions you expected to see in the results?

  1. It is a little inconvenient not to be able to see the spatial coverage
    values until you call up the details window (only the label values appear).
    Then again, for places with two dozen spatial entities attached, seeing all
    of them will be inconvenient. Not sure which wins here.

Yeah, I thought about that too. We could only list the first N characters followed by "…"; would that help?

  1. Is there a way to match simple to compound names, e.g. “Neolitikum” to
    “tidigneolitikum”, or “Late Classical” and/or “Hellenistic” to “Late
    Classical--Hellenistic”, or “Third Intermediate” to “name=Third
    Intermediate”? I think we’re likely to get false negatives for a large
    number of labels in Germanic languages where prefixes have been added, and
    we’ll also set the bar higher for aggregators like Europeana, who may not be
    enthusiastic about cleaning up thousands of period label fields from their
    contributors. If we can match strings from the middle of words (e.g.
    lithikum), will this make things easier or harder?

We can do that easily. It will increase recall (fewer false negatives) and hurt precision (more false positives). We just need to try it and see where the sweet spot is.

  1. On a similar note, how should we deal with the common case in which the
    period label expresses a range (e.g. “Late Archaic--Early Classical” or
    “Late Republican--Early Imperial” in the Met dataset) or a set of
    alternatives (“Late Republican or Imperial”) that may or may not be a
    continuous range? Do we simply say you can’t reconcile these with PeriodO? I
    don’t know if it’s possible to do a one-to-many relationship with Refine. I
    suppose you could have two reconciliation columns, include in the second
    only range values, and match the first column to the earliest period and the
    second to the latest. But that does still mean that an expression like “Late
    Archaic--Early Classical” or “Late Archaic or Early Classical” or “Late
    Archaic/Early Classical” or “Late Archaic, Early Classical” would have to
    turn up potential matches including both “Late Archaic” and “Early
    Classical”.

This kind of string manipulation is an area where Refine really shines, so I think that if someone has period labels like this that they are trying to reconcile, the best bet is for them to define two (or more) new derived columns with the alternatives, and then reconcile these derived columns rather than the original labels.

  1. Can we relax the date parameters so that periods that have a start or end
    (or centerpoint, for that matter) within, say, 100 years of the date-range
    in a spreadsheet still appear as possible matches? This would at least show
    the reconciler that his or her date-range fell outside those of periods with
    the same name and spatial coverage in PeriodO. There were a couple of
    examples here where e.g. the “Villanovan Geometric” period didn’t come up
    with a match because our Villanovan Geometric didn’t overlap (that one ended
    in -800, ours started in -775). But you’re still probably dealing with the
    same concept in this case. Really different concepts with similar labels
    (Woodland Archaic vs Greek Archaic) are likely to be in different places and
    off by a lot more than a century on either side.

Yes, we can do that easily, although again it will increase recall at the cost of precision. Maybe we could add an option when starting up the server that would allow the user to specify whether they want higher precision, or higher recall, which would then affect how we do the matching (both for labels and temporal extents). It could be high-precision by default, and if they had a lot of unmatched labels they could try high-recall.

  1. Speaking of which, can we add another reconciliation column that also
    looks at period labels? I’m thinking of the frequent column in museum
    databases entitled “culture”, which overlaps in many cases with period
    labels (e.g. “Archaic Greek”). For the Met, for example, this might help get
    more matches, especially for modifiers like “Lydian” that are more likely to
    be reflected in a label than in spatial coverage (e.g.
    http://n2t.net/ark:/99152/p0cmdf9kmfv).

You can reconcile as many columns as you want right now. The reconciliation server doesn't know anything about columns; it just takes as input the data that Refine sends it when you pick a column to reconcile.

  1. I think some or all of these adjustments will produce a lot more matches.
    But when none of this helps, is there a manual reconciliation process that
    would be somewhere between automated matching and just looking stuff up in
    PeriodO? If your goal is to have a column full of PeriodO URIs, what do you
    do when you have a cell that doesn’t match? Download the csv and fill it in
    by hand?

That's a good question. I'm not sure what the recommended way to do this with Refine is, but I'll look into it.

  1. Can we make a GREL cookbook or sample scripts to derive columns with URLs
    or date-ranges or start/end dates from the matched PeriodO columns? That is,
    you’re starting with a list of periods, and you reconcile them with PeriodO
    periods, and you want to see the PeriodO date ranges too. There used to be a
    way to extract spatial coordinates from Pleiades URIs in Refine, and it was
    super helpful if you weren’t starting with spatial coordinates that were
    already set. This is more for the student/data novice group, but it might
    help build our audience. I can think of teaching exercises for undergrads
    that would work with this.

Good idea; this is the kind of thing that can go into documentation / resources when we publicize the tool.

  1. Areas where we’re deficient/will run into trouble: there are some things
    that I noticed across these datasets that I can fix -- gaps in some Greek
    and Roman periods, and language gaps, especially in German (and a bit in
    Danish -- these are evident in the Europeana dataset, so I will try to
    address the former with the German national library). But there’s a bigger
    issue for spatial matching, which is the coverage labels referring to
    historical spatial entities rather than present ones. Museums in particular
    will tend to use modern nations when they talk about provenance, for reasons
    related to their recordkeeping; so if we try to reconcile with space as a
    factor and have only “Roman Empire” for the spatial coverage of “Late
    Republican”, we’re out of luck. Does this mean that we should try to parse
    these in terms of modern nations now? Or should we wait to see if we can
    link to a URI for a historical polygon in, say, Wikidata, and then establish
    the modern countries involved through Allen operators?

I think we want to move toward using polygons rather than place names to do spatial matching. The code that Bits Coop is writing for us will be usable in the reconciliation server for this purpose.

Sensitivity of string matching

I know this is already on the list, but just in case the period is a different issue, the PeriodO label "Athenian supremacy, 479-431 B.C" without the final period in "B.C." (stripped in the LCSH LD interface) won't match against "Athenian supremacy, 479-431 B.C." with the final period.

"SyntaxError: unexpected token function" on run

I have NodeJS and Refine installed on a PC running Windows 7 SP1. I successfully installed the reconciler from the command line and then performed an update as suggested by npm (now 5.0.3). When I ran the reconciler on a json dump from the canonical dataset entiteld "p0d.json", however, the following error message appeared:

firstrun_error_062317

Matching multiple periods to a single value

To continue part of the discussion in #2, I would strongly suggest that we look for a way in which one value in the original dataset could be matched to two or more PeriodO definitions. There will be a lot of cases in which the original definition will be something like "Archaic-Classical", which indicates an uncertain date range between the beginning of the Archaic period and the end of the Classical (or that it's from a tighter date-range that straddles the boundary -- this seems to be the case for a lot of material in the Heidelberg Epigraphic Database, for example). To ask the user to split this into two separate periods, and then go back and split all the items with this attribute into those two periods, seems too burdensome. It also means something different -- an object that is "Archaic-Classical" could be from either period, but an object that is "Archaic" and "Classical" suggests that it's from both.

If it's possible to create a compound period for the purposes of reconciliation, so that "Archaic--Classical" returned both "Archaic" and "Classical" and you could select them both, I think this would be most attractive to data managers.

If that's not possible, however, we should at least return both "Archaic" and "Classical" for "Archaic--Classical" (right now we wouldn't return either, because it would only be a partial match), so that in a pinch a data manager could just reconcile each of two or more duplicate columns with different values, so that you'd have a row with "Archaic--Classical", "PeriodO value for Archaic", "PeriodO value for Classical", and you could explain the relationship in the data model.

Better control of pop-up window with period details

The pop-up screen with the period definition details appears in a way that's visible when the value is at the very top or the very bottom of the page, but when the value is near the top or in the upper part of the middle, the pop-up appears partially cut off by the upper bound of the reconciler window. Often you can't scroll down or up far enough to get the window not to be cut off for these values. Is there way we can always make it appear in the middle of the screen so that it's not cut off?

Broader/narrower matches (continues convo in #2)

The reconciler will return narrower values as possible matches (e.g. "Orientalizing" will return "Early Orientalizing"), but it will not return broader values (e.g. "Early Uruk" will not return "Uruk" as a possible match). This seems to be because the algorithm will only consider possible matches that contain at least all words in the original term, including parenthetical statements, spatial adjectives, etc. Thus "Jordanian Chalcolithic" will not return "Chalcolithic", and "Vikingatid (600-1000 AD)" will not return "Vikingatid". I realize that this is related to efforts to create results sets that are narrow enough to be useful, but it will either require that the user carry out a major amount of data cleaning in advance, or that we accept a lot of false negatives. I would favor broader recall that would return "Uruk" among the possible matches for "Early Uruk" -- or at least testing this out to see how confusing it makes things.

Substitutions for special characters and numerals

The string matching algorithm at the moment distinguishes between special or accented characters and plain characters, so "Dong Son" for the Vietnamese period will return different results from "Dông Son" (or one will return none). Can we make a basic character substitution map so that special or accented characters match plaintext, or is this too complicated with Unicode?

On a related note, I would suggest that we definitely make the less complicated substitution map between Arabic and Roman numerals, so that "Early Bronze Age 2" returns possible matches in the form "Early Bronze Age II". This will increase recall without creating a lot of false positives, especially if we only substitute between numbers from 1-10 (or at most 1-20 -- no one has phase numbering beyond 20, though you will sometimes see a XII or so).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.