direct-phonology / dphon Goto Github PK

View Code? Open in Web Editor NEW

12.0 2.0 1.0 6.83 MB

uncover old chinese textual parallels based on sound

License: MIT License

Python 100.00%

chinese-traditional text-analysis phonology python nlp

dphon's Introduction

dphon

installation

this software is tested on the latest versions of macOS, windows, and ubuntu. you will need a supported version of python (above), along with pip.

$ pip install dphon

if you're on windows and are seeing incorrectly formatted output in your terminal, have a look at this stackoverflow answer.

usage

basics

the main function of dphon is to look for instances of text reuse in a corpus of old chinese texts. instead of relying purely on graphemes, it does this by performing grapheme-to-phoneme conversion, and determining possible reuse based on whether passages are likely to have sounded similar (or rhymed) when spoken aloud.

you will need to have files stored locally as utf-8 encoded plain-text (.txt) or json-lines (.jsonl) format. for the former, one file is assumed to represent one document. for the latter, one file can contain any number of lines, each of which is a document, with required keys id (a unique identifier) and text (text content) and any number of optional keys. you can obtain a representative corpus of old chinese sourced from the kanseki repository via direct-phonology/ect-krp.

a simple invocation of dphon might look like:

$ dphon text_a.txt text_b.txt

which would look for phonetically similar passages between text_a and text_b. the output will be a list of sequences and their phonemic transcriptions, with an identifier based on the file's name and an indicator of where in the text the sequence occurs:

1.  text_a (2208–2216)：
    夏后啟曰以為可為故為之為之天下弗能
    *ləʔ ɢʷraj kʰˤajʔ ɢʷraj kˤaʔs ɢʷraj tə ɢʷraj tə
2.  text_b (3340–3348)：
    不可弗爲以爲可　故爲之爲之繇其道物
    *ləʔ ɢʷraj kʰˤajʔ kˤaʔs ɢʷraj tə ɢʷraj tə pit

the numbers next to the identifiers are token indices, and may vary depending on how the text is tokenized – dphon currently uses character-based tokenization. whitespace will be removed, and the output will be aligned to make it easier to spot differences between the two sequences. by default, insertions are highlighted in green, and mismatches (differences between the two sequences) are highlighted in red. additional (non-matching) context added to either side of match sequences is displayed using a dimmed color (see "advanced usage" below for more information on colorization).

matches are sorted by the ratio of their phomenic similarity to their graphic similarity – in other words, matches between texts that sound highly similar but were written very differently will be at the top of the list.

by default, dphon only returns matches that display at least one instance of graphic variation – a case where two different graphemes are used in the same place to represent the same sound. these cases are highlighted in blue. if you're interested in all instances of reuse, regardless of graphic variation, you can use the --all flag:

$ dphon --all text_a.txt text_b.txt

you can view the full list of command options with:

$ dphon --help

this tool is under active development, and results may vary. to find the version you are running:

$ dphon --version

advanced usage

by default, dphon uses your system's $PAGER to display output, since the results can be quite long. on MacOS and Linux, this will likely be less, which supports additional options like searching through the output once it's displayed. for more information, see the man page:

$ man less

dphon can colorize output for nicer display in the terminal if your pager supports it. to enable this behavior on MacOS and Linux, set LESS=R:

$ export LESS=R

if you want to save the results of the run to a file, you can use redirection. this is useful when writing structured formats like .csv and .jsonl. you can also write html to preserve colors:

$ dphon -o html files/*.txt > results.html

alternatively, you can pipe the output of dphon to another utility like sed for filtering the results further. for example, you could strip out the ideographic space 　 from results to remove the alignments:

$ dphon files/*.txt | sed 's/　//g'

methodology

matching sequences are determined by a "dictionary" file that represents a particular reconstruction of old chinese phonology. these data structures perform grapheme-to-phoneme conversion, yielding the associated sound for each character:

"埃": "qˤə"
"哀": "ʔˤəj"
"藹": "qˤats"
...

if two characters have the same phonemes, they're treated as a match. for characters with multiple readings, dphon currently chooses the first available reading for comparison. more work is planned for version 3.0 to address this shortcoming.

in version 1.0, dphon's default reconstruction was based on Schuessler 2007¹, but used a single "dummy" character to represent all the lexemes in a rhyming group. the dictionary was compiled by John O'Leary (@valgrinderror) and Gian Duri Rominger (@GDRom). since version 2.0, dphon uses a dictionary based on the Baxter-Sagart 2014 reconstruction², with additional work by Rominger.

the matching algorithm is based on Paul Vierthaler's chinesetextreuse project³, with some modifications. it uses a BLAST-like strategy to identify initial match candidates, and then extend them via phonetic edit distance comparison. finally, the results are aligned using a version of the Smith-Waterman algorithm that operates on phonemes, powered by the lingpy library⁴.

development setup

first, clone the repository:

$ git clone https://github.com/direct-phonology/dphon.git
$ cd dphon

then, to create and activate a virtual environment (recommended):

$ python -m venv venv
$ source venv/bin/activate

install dependencies:

$ pip install -r dev-requirements.txt

finally, install the package itself in development mode:

$ pip install -e .

now your changes will be automatically picked up when you run dphon.

pull requests can be made against main.

code documentation

code documentation is available on github pages and is generated with pdoc3.

to build the docs:

$ pdoc --html --output-dir docs dphon

tests

unit tests are written with unittest. you can run them with:

$ python -m unittest

releases

the package is built and published to pyPI automatically using twine when using GitHub's release functionality.

make sure the version number in dphon/__init__.py is correct!

¹ Schuessler, Axel (2007), _ABC Etymological Dictionary of Old Chinese_, Honolulu: University of Hawaii Press, ISBN 978-0-8248-2975-9.

² Baxter, William H.; Sagart, Laurent (2014), Old Chinese: A New Reconstruction, Oxford University Press, ISBN 978-0-19-994537-5.

³ Vierthaler, Paul, and Mees Gelein. “A BLAST-Based, Language-Agnostic Text Reuse Algorithm with a MARKUS Implementation and Sequence Alignment Optimized for Large Chinese Corpora,” April 26, 2019. https://doi.org/10.31235/osf.io/7xpqe.

⁴ List, Johann-Mattis; Greenhill, Simon; Tresoldi, Tiago; and Forkel, Robert (2019): LingPy. A Python library for historical linguistics. Version 2.6.5. URL: http://lingpy.org, DOI: https://zenodo.org/badge/latestdoi/5137/lingpy/lingpy. With contributions by Christoph Rzymski, Gereon Kaiping, Steven Moran, Peter Bouda, Johannes Dellert, Taraka Rama, Frank Nagel. Jena: Max Planck Institute for the Science of Human History.

dphon's People

Contributors

Stargazers

Watchers

Forkers

lingulist

dphon's Issues

add auto-generated code documentation

we could look at Sphinx for this, but something that makes Markdown would be ideal. hosting on GitHub pages would also be preferred.

pdoc3 seems to be perfect, based on initial tests. we can generate and commit the html on pushes/PRs to main, and add it to the docs/ folder to be automatically picked up by github pages.

documenting things using numPy style docstrings seems like a good conventionm, but you can also just use pure markdown with pdoc.

discard identical matches automatically

Sturgeon 2018 has a useful working definition of parallel passages:

For the purposes of this study, ‘parallel passages’ are defined as sections of text which contain significant common subsequences of characters having a high degree of similarity, and in addition appear likely to be causally related in terms of word choice by something more than the writers’ shared linguistic competence of the language.

He then elaborates in a footnote:

Though this characterization is far from an adequate formal definition, it highlights two key features in the parallels concerned here: they must be formally similar in some way, and they must also be non-trivial. Thus ‘歙歙訿訿’ and ‘潝潝訾訾’ are parallels, while ‘此之謂也’ and ‘此之謂也’ are not, despite the former pair consisting entirely of different characters and the latter pair being entirely identical, since the occurrence of the former independently in two texts given only our knowledge of the language itself would be highly improbable, whereas the latter would not.

This conception is useful; it allows us to focus on presenting information that is "non-trivial" - that is, information that requires the use of DIRECT or similar computation to discover at scale. If we accept that identical matches are "trivial", there's no reason to expend compute power on them.

matches are shorter than expected

there appears to be a bug in the match-lengthening logic. this is visible if you run some of the fixture files against each other, particularly:

dphon tests/fixtures/道德經/老子.txt tests/fixtures/郭店/老子乙.txt

we see in that output:

靜勝熱。清靜為天 (老子: 45)
清勝熱，清靜為天 (老子乙: 7)

this is interesting, given that 老子 has:

躁勝寒靜勝熱。清靜為天下正

while 老子乙 has:

燥勝凔，清勝熱，清靜為天下正

clearly 下正 at a minimum should be included in the match (leaving aside the potential graphic variants at the start, which might be included in a "fuzzy match").

it seems like there may be a logic bug in the match extension function - perhaps we are making the wrong assumption about ordering of matches.

write tests for Index component

use optimized translate/codec functions

python provides str.translate() as part of stdlib, along with its helper str.maketrans(), which appear to be explicitly designed for performing 1:1 character lookup/replacement ("translation"). this fits our use case of transforming input based on a dictionary.

looking at some forum discussion on str.translate() performance it's evident that the method had some performance issues in the past, but it's not immediately clear how it would impact performance with the case of a potentially large, non-ASCII table.

it appears also in this discussion on codecs.charmap_build() that the codecs module might offer greater performance since it uses faster C representations of the data, although charmap_build() itself is undocumented. maybe we could create a text transform codec like rot13 and benchmark performance of that.

explore different segmentation methods

it's an open question whether this is worthwhile or possible on old chinese, but there are certainly existing options outside of pure single-character segmentation. spaCy offers both jieba and pkuseg for its chinese model.

we could explore retraining some general-purpose models, or work on assembling a gold corpus of segmented (and tagged?) text - again, if such a thing is possible. pkuseg's training API looks simple to use, at least.

Specify sources of the data

Maybe this is there already, but I could not find it when searching your code: where can I find sources of the data? And where do you describe how you render them? I mean, seeing a file called "Schuessler", I know what to suspect, but where exactly does this come from, what version of Schuessler do you render?

write tests for Phonemes component

correctly segment based on unicode grapheme clusters

background: UAX TR29 report

it's clear based on this discussion on str methods that at least some of them (all that attempt to count graphemes) are not designed to handle grapheme clusters. since it's likely we could be handling characters that span multiple codepoints, we will probably run into this issue eventually and end up iterating halfway "through" a character when reading a text.

this function isn't provided by the python stdlib, and there are a variety of competing packages:

PyICU is an implementation of the full International Components for Unicode, providing a huge range of functionality. it is a wrapper for some C++ libraries and requires them to be locally installed in order to run.
pytextseg is listed as "beta" on PyPI and had its last release in 2012. it uses its own custom string implementation and also requires building locally prior to running.
uniseg is also listed as "beta" on PyPI and had its last release in 2015. it uses python's internal str.
grapheme is listed as "alpha" on PyPI and had its last release in march of this year.

it might be worth benchmarking some of these against each other to make a choice, since apparently their implementations differ.

supporting filtering results to only those with graphic variation

this is trickier than it sounds. the most accurate implementation might do something like:

align the match (#111)
linear probe both parts of the match until the two characters are a mismatch
check if the mismatched characters have the same phonetic value; if not continue
if they do, return True immediately
if we reach the end of the match return False

there might be a way to handle this during the extension phase, but it's not obvious how to ensure the variant actually connects two aligned characters.

add basic documentation to README

some things we should probably add sooner rather than later:

badges () - nice to have
setup instructions for local development
instructions for how to run tests

there might be an easy way to add the command's "help screen" (the output of dphon --help) to the README automatically too.

add structured input format

companion to #39; may accept some of the same formats (e.g. jsonl). ideally this will break large texts up into subunits, packaged along with metadata like the name of the text and section. format should be designed to make it easier to process many texts at once. we can take inspiration from passim's format.

support comparing a single text against itself

we've gone back and forth over whether this is useful, but it seems like something worth supporting. one case i can imagine is comparing poems from the 诗经 against themselves, to check whether or not the rhymes hold up under various reconstructions. you could also use it to compare multiple texts via a command like:

$ cat text1.txt text2.txt | dphon -

which would treat them all as a single piece of input, although not sure what this would mean in terms of line numbers.

move sound dictionaries to new repository

the new repository can serve to version the dictionaries separately from the code, as well as publish the dictionaries as tables usable with string.translate() (see #10).

set up continuous integration with travis ci

this will allow us to have tests run automatically when new code is committed to the repository.

update README

ensure badges are up-to-date
update section on release workflow
update usage section

score results for relevance

running against a large corpus, especially with some settings, can result in a huge volume of results. many of them are "low-quality" in that the matching portion consists of superficially similar elements that don't carry much semantic weight.

adjusting the match length can help, but there might be other heuristics we can use to improve relevance. one possibility is TF-IDF.

allow showing only matches with graphic variation

this is a useful function we've talked about repeatedly, and is also probably a prerequisite for #12 to work. presumably it would be implemented as a flag.

parallelize algorithms

several stages of the data pipeline can be easily parallelized, especially the extension and alignment which take the most time. we should look at python's built-in multiprocessing module for this.

also worth a look is this spaCy example using joblib.

improve unittest error/diff output

there are a few problems with the way AssertionErrors are output when running tests. we currently have output like:

Traceback (most recent call last):
  File "[...]/test_aligner.py", line 60, in test_spacing
    '由也千乘之國可使治其賦也　　　　　　　　　　　求也千室之邑百乘之家'
AssertionError: '由也千乘[33 chars]u3000求也\u3000\u3000\u3000\u3000\u3000\u3000千室之[92 chars]賓客言也' != '由也千乘[33 chars]u3000\u3000\u3000\u3000\u3000\u3000\u3000求也千室之[92 chars]賓客言也'
- 由也千乘之國可使治其賦也　　　　　求也　　　　　　千室之邑百乘之家可使為之宰　　　　　　赤也　　　　　　束帶立於朝可使與賓客言也
?                  --                         --
+ 由也千乘之國可使治其賦也　　　　　　　　　　　求也千室之邑百乘之家可使為之宰　　　　　　　　　　　　赤也束帶立於朝可使與賓客言也
?                        ++                         ++

it would be good to address:

truncation of the error message ([33 chars])
some unicode (full-width spaces) not rendering in error message (\u3000)
diffs not using full-width spacing, so symbols like -- and ++ don't line up

we may want to skip rendering the error message entirely in favor of the diff.

add progressbar when running

we've now hit cases where the app takes awhile to run, so it's useful to show the user something. probably we will want progressbar2.

matching sequence lengths differ

taking as an example the same command that reproduced bug #23:

dphon tests/fixtures/道德經/老子.txt tests/fixtures/郭店/老子乙.txt

we see in that output:

中士聞道，若存若亡；下士聞道 (老子: 41)
中士聞道 (老子乙: 5)

this obviously doesn't make sense, as the sequence lengths should always be identical (ignoring punctuation, etc).

enhance match display in default output

it's difficult to interpret results without viewing the match in at least some document context. we certainly don't need the entire document, but a few characters on either side would probably help. we can also highlight the actual match portion in color and/or bold the portions that are graphic variants.

add optional context on either side of match (see comparable grep options -B and -A)
add optional highlights for insertions, substitutions, and variants using text/background color, bold, italics, etc. (--color)
replace newlines in output with ⏎ or a similar indicator
allow displaying the actual phonemes along with the text

~~it might be worth writing a match.pretty() method or similar to encapsulate this functionality.~~ the best way to do it is probably by removing this concern from Match entirely, and writing a Formatter class that handles the display of matches.

note that writing to a file should disable --color, since the output will otherwise include a lot of unwanted control characters.

align matches for correctness and readability

using a matcher based on edit distance ratios results in matches that may include irrelevant material at the end; alignment is necessary to remove this material from the match.

additionally, alignment allows for padding gaps with spaces for easier comparison of the results, and can aid in colorizing the results to highlight differences (#54). it could play into the structured output (#39).

add structured output format

in addition to the default method of just printing things in plaintext, we need a way to save the output data for consumption by other programs. the simplest might be jsonl; we could also look at writing to parquet with pyarrow.

allow comparator functions to work with string or file handle

this can be implemented by having the functions check their input for hasattr(read) and calling it automatically to get the string if present. should make #15 easier to implement.

change repo name to dphon

the package is published under this name, so we should use it for the repository too.

support reading from stdin

this probably a low-priority enhancement, since few people would be using it on the command-line anyway, but could play into calling it externally via wsgi.

rename default branch to main

see github-default-branch for automating this.

publish packages automatically using github actions

see the pyPA docs on this for more info. this is the relevant workflow file.

we can use this package to generate a badge for code coverage automatically at the time of release.

support reading multiple files

see passim's note about Spark conventions for more info on how this could work. not sure if it will be possible to read both plaintext files and structured input (#42) at the same time.

implement black code style

like #41, ideally this would leave easily readable info in the form of a status check. we want to run both pylint and mypy.

it looks like the python-lint action can do both. we could look at using something like pylint-badge to generate a badge for the README.

see also https://github.com/marketplace/actions/run-flake8-on-your-pr-with-annotations for an example with code annotation.

missing characters are treated as punctuation

some texts use glyphs like □ (WHITE SQUARE, U+25A1) to indicate locations where characters are missing, obscured or damaged. since glyphs like this are technically not alphabetic in the general sense, they are being treated as punctuation currently, which means matches can extend "through" them. using two of the sample fixtures:

$ dphon tests/fixtures/郭店/老子丙.txt tests/fixtures/郭店/老子乙.txt

shows:

天下。故 (老子丙: 3)
天下□□□□□□□□□□□家 (老子乙: 8)

presumably we want to handle □ (and maybe other glyphs) as a special case when forming initial matches.

raise code coverage above 90%

depends on #41

ideally we would have the main unittest workflow post coverage reports as a status notification (or at least save them for download), and then show it off as a badge on the README.

allow specifying a minimum match length

for a minimum length, this is already implemented in the code but not in the CLI. for exact length we'll need to do something else - probably just set the minimum, run, and then don't group/extend the matches.

add logging/debug output

see the built-in logging module tutorial.

support specifying an exact match length

broken off from #20; need to discuss how we want this to work.

switch to poetry for dependency management

this will involve creating a pyproject.toml file, which seems like generally a good thing to have. should be useful to evaluate how this can simplify project setup/development.

set up code coverage measurement

we're collecting with pytest-cov but not doing anything with this information currently.

add integration tests for spaCy

the library should hook into spaCy's entrypoints so that custom pipeline components can be automatically created - approach detailed here.

probably we will want each set of integration tests (e.g. spaCy, nltk, cltk) to run as a separate github actions workflow.

specify that only pythons >= 3.6 are supported

this is because of the type hints that are used in lib.py.

switch phonetic tokens to zhuyin system

generate a new JSON mapping from gian's prototype sound table
check that tests pass using new mapping

document all code modules

depends on #120.

support reading nltk/cltk corpora

see https://www.nltk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader

make match function punctuation-aware

some n-grams include punctuation as a character. for now, we can add a flag to control this behavior (include punctuation or not).

group matches after extension

this plays into #54, since the groups are a little easier to read/parse.

unlike some of the operations for #139, this grouping doesn't mutate edges. in fact, its output might be another kind of complete graph, or it might be a simple list of lists or other non-graph-related structure. apparently this shape is called a star by networkx; see e.g. add_star()

one way this could work using the data= param for networkx's edges():

query for all the edges that connect to the given node using

g.edges([n, None], data=True)

where g is a MultiGraph in which nodes are docs and edges are matches and n is the target doc. this will return all the data associated with the edge, and it will helpfully express the edges with the target node first:

MultiEdgeDataView([(n, other, {"foo": "bar"}), (n, other, {"foo": "bar"}), ...])

group the edges via the sequence bounds in n. in other words, if there are two matches whose Span in n has the same start and end, group them. even if the actual aligned text differs due to spacing, we want to aggregate them all so that we can display the corresponding sequences that aren't in n together.
when we display the group, we'll just use the actual unaligned text of n between both bounds of the Span (or group) once, followed by all the sequences from docs that aren't n.

benchmark using difflib's SequenceMatcher

see the docs on Differ, which offer:

The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.

if performance for expected inputs (i.e. the kanripo corpus) is asymptotically linear, it might be worth swapping out the index-based method to instead diff the two doc's sets of ngrams, as in text-matcher.

update extension strategy to match BLAST

the actual implementation of BLAST extends in both directions, which would help us because it would allow discarding matches immediately after the seeding step (instead of needing to extend and align prior to discarding in order to make sure the match meets the criteria for being discarded).

BLAST also uses a poisson method to do what we called "condensing" high-similarity regions; BLAST2 used the simpler sum-of-scores method. In either case it may be worth revisiting this step to see if it affords performance gains.

see Vesanto 2019 for a broad overview of using BLAST for text reuse.

deposit on zenodo

we can turn on the github auto-deposit feature so that each release gets its own DOI.

cp1252 codec causes UnicodeDecodeErrors on Windows

running the tests on windows is throwing UnicodeDecodeErrors due to (possibly implicit?) usage of the cp1252 codec. unicode output in the terminal looks garbled.