periodo / periodo-reconciler Goto Github PK
View Code? Open in Web Editor NEWOpen Refine reconciliation service for PeriodO data
Open Refine reconciliation service for PeriodO data
- Not all values that I would expect to bring up partial matches are doing
so (eg “Jordanian Chalcolithic” in the Perucchetti dataset, or “Possibly
Archaic” or “Late Republican” in the Met). Also, some direct matches in the
Europeana spreadsheet didn’t come up (e.g. “vikingatid”). Alo, in the
Europeana dataset, lots of matches came up for “Hellenistic Period” but none
at all for “Hellenistic Period (323-31 BC)”.
These are probably due to how I've got the string matching configured; right now I think it is biased toward higher precision and lower recall, which is maybe what we're seeing here. For the ones that you expected to bring up partial matches, can you you give me URLs of some of the definitions you expected to see in the results?
- It is a little inconvenient not to be able to see the spatial coverage
values until you call up the details window (only the label values appear).
Then again, for places with two dozen spatial entities attached, seeing all
of them will be inconvenient. Not sure which wins here.
Yeah, I thought about that too. We could only list the first N characters followed by "…"; would that help?
- Is there a way to match simple to compound names, e.g. “Neolitikum” to
“tidigneolitikum”, or “Late Classical” and/or “Hellenistic” to “Late
Classical--Hellenistic”, or “Third Intermediate” to “name=Third
Intermediate”? I think we’re likely to get false negatives for a large
number of labels in Germanic languages where prefixes have been added, and
we’ll also set the bar higher for aggregators like Europeana, who may not be
enthusiastic about cleaning up thousands of period label fields from their
contributors. If we can match strings from the middle of words (e.g.
lithikum), will this make things easier or harder?
We can do that easily. It will increase recall (fewer false negatives) and hurt precision (more false positives). We just need to try it and see where the sweet spot is.
- On a similar note, how should we deal with the common case in which the
period label expresses a range (e.g. “Late Archaic--Early Classical” or
“Late Republican--Early Imperial” in the Met dataset) or a set of
alternatives (“Late Republican or Imperial”) that may or may not be a
continuous range? Do we simply say you can’t reconcile these with PeriodO? I
don’t know if it’s possible to do a one-to-many relationship with Refine. I
suppose you could have two reconciliation columns, include in the second
only range values, and match the first column to the earliest period and the
second to the latest. But that does still mean that an expression like “Late
Archaic--Early Classical” or “Late Archaic or Early Classical” or “Late
Archaic/Early Classical” or “Late Archaic, Early Classical” would have to
turn up potential matches including both “Late Archaic” and “Early
Classical”.
This kind of string manipulation is an area where Refine really shines, so I think that if someone has period labels like this that they are trying to reconcile, the best bet is for them to define two (or more) new derived columns with the alternatives, and then reconcile these derived columns rather than the original labels.
- Can we relax the date parameters so that periods that have a start or end
(or centerpoint, for that matter) within, say, 100 years of the date-range
in a spreadsheet still appear as possible matches? This would at least show
the reconciler that his or her date-range fell outside those of periods with
the same name and spatial coverage in PeriodO. There were a couple of
examples here where e.g. the “Villanovan Geometric” period didn’t come up
with a match because our Villanovan Geometric didn’t overlap (that one ended
in -800, ours started in -775). But you’re still probably dealing with the
same concept in this case. Really different concepts with similar labels
(Woodland Archaic vs Greek Archaic) are likely to be in different places and
off by a lot more than a century on either side.
Yes, we can do that easily, although again it will increase recall at the cost of precision. Maybe we could add an option when starting up the server that would allow the user to specify whether they want higher precision, or higher recall, which would then affect how we do the matching (both for labels and temporal extents). It could be high-precision by default, and if they had a lot of unmatched labels they could try high-recall.
- Speaking of which, can we add another reconciliation column that also
looks at period labels? I’m thinking of the frequent column in museum
databases entitled “culture”, which overlaps in many cases with period
labels (e.g. “Archaic Greek”). For the Met, for example, this might help get
more matches, especially for modifiers like “Lydian” that are more likely to
be reflected in a label than in spatial coverage (e.g.
http://n2t.net/ark:/99152/p0cmdf9kmfv).
You can reconcile as many columns as you want right now. The reconciliation server doesn't know anything about columns; it just takes as input the data that Refine sends it when you pick a column to reconcile.
- I think some or all of these adjustments will produce a lot more matches.
But when none of this helps, is there a manual reconciliation process that
would be somewhere between automated matching and just looking stuff up in
PeriodO? If your goal is to have a column full of PeriodO URIs, what do you
do when you have a cell that doesn’t match? Download the csv and fill it in
by hand?
That's a good question. I'm not sure what the recommended way to do this with Refine is, but I'll look into it.
- Can we make a GREL cookbook or sample scripts to derive columns with URLs
or date-ranges or start/end dates from the matched PeriodO columns? That is,
you’re starting with a list of periods, and you reconcile them with PeriodO
periods, and you want to see the PeriodO date ranges too. There used to be a
way to extract spatial coordinates from Pleiades URIs in Refine, and it was
super helpful if you weren’t starting with spatial coordinates that were
already set. This is more for the student/data novice group, but it might
help build our audience. I can think of teaching exercises for undergrads
that would work with this.
Good idea; this is the kind of thing that can go into documentation / resources when we publicize the tool.
- Areas where we’re deficient/will run into trouble: there are some things
that I noticed across these datasets that I can fix -- gaps in some Greek
and Roman periods, and language gaps, especially in German (and a bit in
Danish -- these are evident in the Europeana dataset, so I will try to
address the former with the German national library). But there’s a bigger
issue for spatial matching, which is the coverage labels referring to
historical spatial entities rather than present ones. Museums in particular
will tend to use modern nations when they talk about provenance, for reasons
related to their recordkeeping; so if we try to reconcile with space as a
factor and have only “Roman Empire” for the spatial coverage of “Late
Republican”, we’re out of luck. Does this mean that we should try to parse
these in terms of modern nations now? Or should we wait to see if we can
link to a URI for a historical polygon in, say, Wikidata, and then establish
the modern countries involved through Allen operators?
I think we want to move toward using polygons rather than place names to do spatial matching. The code that Bits Coop is writing for us will be usable in the reconciliation server for this purpose.
Hi @rybesh,
This reconciliation endpoint looks great! I wanted to add it to our list at https://reconciliation-api.github.io/testbench/ but could not find if it is hosted anywhere? Do you have plans to deploy it?
Keep up the good work!
I know this is already on the list, but just in case the period is a different issue, the PeriodO label "Athenian supremacy, 479-431 B.C" without the final period in "B.C." (stripped in the LCSH LD interface) won't match against "Athenian supremacy, 479-431 B.C." with the final period.
I have NodeJS and Refine installed on a PC running Windows 7 SP1. I successfully installed the reconciler from the command line and then performed an update as suggested by npm (now 5.0.3). When I ran the reconciler on a json dump from the canonical dataset entiteld "p0d.json", however, the following error message appeared:
Running the reconciler against the LCSH subject headings does not match "Bourbons, 1700-", despite http://n2t.net/ark:/99152/p06c6g3qwzd.
To continue part of the discussion in #2, I would strongly suggest that we look for a way in which one value in the original dataset could be matched to two or more PeriodO definitions. There will be a lot of cases in which the original definition will be something like "Archaic-Classical", which indicates an uncertain date range between the beginning of the Archaic period and the end of the Classical (or that it's from a tighter date-range that straddles the boundary -- this seems to be the case for a lot of material in the Heidelberg Epigraphic Database, for example). To ask the user to split this into two separate periods, and then go back and split all the items with this attribute into those two periods, seems too burdensome. It also means something different -- an object that is "Archaic-Classical" could be from either period, but an object that is "Archaic" and "Classical" suggests that it's from both.
If it's possible to create a compound period for the purposes of reconciliation, so that "Archaic--Classical" returned both "Archaic" and "Classical" and you could select them both, I think this would be most attractive to data managers.
If that's not possible, however, we should at least return both "Archaic" and "Classical" for "Archaic--Classical" (right now we wouldn't return either, because it would only be a partial match), so that in a pinch a data manager could just reconcile each of two or more duplicate columns with different values, so that you'd have a row with "Archaic--Classical", "PeriodO value for Archaic", "PeriodO value for Classical", and you could explain the relationship in the data model.
The pop-up screen with the period definition details appears in a way that's visible when the value is at the very top or the very bottom of the page, but when the value is near the top or in the upper part of the middle, the pop-up appears partially cut off by the upper bound of the reconciler window. Often you can't scroll down or up far enough to get the window not to be cut off for these values. Is there way we can always make it appear in the middle of the screen so that it's not cut off?
The reconciler will return narrower values as possible matches (e.g. "Orientalizing" will return "Early Orientalizing"), but it will not return broader values (e.g. "Early Uruk" will not return "Uruk" as a possible match). This seems to be because the algorithm will only consider possible matches that contain at least all words in the original term, including parenthetical statements, spatial adjectives, etc. Thus "Jordanian Chalcolithic" will not return "Chalcolithic", and "Vikingatid (600-1000 AD)" will not return "Vikingatid". I realize that this is related to efforts to create results sets that are narrow enough to be useful, but it will either require that the user carry out a major amount of data cleaning in advance, or that we accept a lot of false negatives. I would favor broader recall that would return "Uruk" among the possible matches for "Early Uruk" -- or at least testing this out to see how confusing it makes things.
This would enable reconcilers using OpenRefine to populate their table with data from PeriodO properties, e.g. pull the start and end years.
https://github.com/OpenRefine/OpenRefine/wiki/Data-Extension-API
The string matching algorithm at the moment distinguishes between special or accented characters and plain characters, so "Dong Son" for the Vietnamese period will return different results from "Dông Son" (or one will return none). Can we make a basic character substitution map so that special or accented characters match plaintext, or is this too complicated with Unicode?
On a related note, I would suggest that we definitely make the less complicated substitution map between Arabic and Roman numerals, so that "Early Bronze Age 2" returns possible matches in the form "Early Bronze Age II". This will increase recall without creating a lot of false positives, especially if we only substitute between numbers from 1-10 (or at most 1-20 -- no one has phase numbering beyond 20, though you will sometimes see a XII or so).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.