jjmccollum / teiphy Goto Github PK

A Python package for converting TEI XML collations to NEXUS, BEAST 2.7 XML, and other formats

License: MIT License

Python 99.90% Shell 0.10%

nexus phylogenetics python tei-xml text-encoding

teiphy's Introduction

Welcome!

I'm interested in projects at the intersection of (digital) humanities and computer science. The focus of my PhD research is the adaptation and application of phylogenetic methods for textual criticism, specifically in the tradition of the Greek New Testament. You'll likely find new tools for preparing transcription and collation data (especially in TEI XML format) for phylogenetic and other analyses on here as I continue to develop them according to my and others' needs. Occasionally, I'll also host textual transcriptions and data for digital editions on here.

I've also produced my own implementation of the Coherence-Based Genealogical Method (CBGM), an approach to textual criticism developed for open traditions like that of the New Testament. The core library (https://github.com/jjmccollum/open-cbgm) and the command-line interface (https://github.com/jjmccollum/open-cbgm-standalone) are available here, and they are being incorporated into more user-friendly software by other developers on GitHub; I'll keep you posted as more updates become available!

teiphy's People

Contributors

Stargazers

Watchers

Forkers

edmondac pharos-alexandria

teiphy's Issues

Support StatesFormat=Frequency for NEXUS

For situations where an ambiguous reading is more likely to be resolved as some readings than as others (e.g., with retroversions and lacunae whose available space makes certain reconstructions more likely than others), we could take advantage of the StatesFormat=Frequency option in NEXUS, which allows us to assign frequencies for different states at each site. Maddison, Swofford, and Maddison's 1997 paper introducing NEXUS illustrates this in the following example matrix:

BEGIN CHARACTERS;
    DIMENSIONS NCHAR=3 ;
    FORMAT
        STATESFORMAT=FREQUENCY
        SYMBOLS="0 1 2";
    MATRIX
        taxon_1 (0:0.25 1:0.75) (0:0.3 1:0.7) (0:0.5 1:0.3 2:0.2)
        taxon_2 (0:0.4 1:0.6) (0:0.8 1:0.2) (1:0.15 2:0.85)
        taxon_3 (0:0.0 1:1.0) (0:0.55 1:0.45) (0:0.1 1:0.9) ;
END;

The trickier task is supporting this on the TEI XML end. Presently, the best way I can think to do it is using certainty elements in witDetail elements (per §21.1.2 of the TEI Guidelines, https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CE.html#CECECE) as follows:

<app xml:id="B10K2V15U24-26">
    <lem><w>εν</w><w>εαυτω</w></lem>
    <rdg n="1" wit="syrh 01C2 06 012 018 020 044 049 056 075 0142 0151 0319 0320 1 6 18 35 81 88 93 94 102 177 181 203 256 296 322 337 363 365 383 398 424 436 442 462 506 606 629 636C 664 665 915 1069 1108 1115 1127 1240 1241 1245 1319 1490 1505 1509 1573 1611 1617 1678 1721 1729 1751 1831 1836 1838 1840 1851 1860 1877 1886 1893 1910/3 1912 1918 1939 1963 1987 1991C 1996 1999 2008 2011 2012 2127 2138 2180 2243 2344 2352 2464 2492 2495 2544 2576 2805 2865 L23 L156 L169 L587 L809 L1159 L1178 L1188 L1298 L1440 L2010 AthanasiusOfAlexandria Epiphanius Speculum RP TR"><w>εν</w><w>εαυτω</w></rdg>
    <rdg n="2" wit="P46 01* 02 03 010 025 0150 33 38 61 69 104 218 263 326 330 451 459 467 1175 1311 1398 1718 1739 1837 1881 1908 1910/1 1910/2 1959 1962 1985 1991* 2004 2400 2516 2523 L60 L2058 SBL TH Tisch Treg WH"><w>εν</w><w>αυτω</w></rdg>
    <rdg n="3" wit="636*"/>
    <witDetail type="ambiguous" target="1 2" cause="commentary" wit="GregoryOfNyssa">
        <certainty target="1" locus="value" degree="0.6667"/>
        <certainty target="2" locus="value" degree="0.3333"/>
    </witDetail>
</app>

(As always, the pointers in the target attributes should technically refer to xml:ids in the rdg elements, but for convenience, this simpler localized notation can also be supported; users can use the notation they prefer, as long as they are consistent in it.)

release to PyPI

make class names start with capital letters

add testing coverage to github actions and have a badge in the readme and docs

start writing JOSS paper

we can copy template paper.md from an example in the JOSS docs. We can also set up a bibtex file.

testing coverage to 100%

At the moment it is around 93%.

Add --version argument to app

Per the first round of reviewer comments on the paper (openjournals/joss-reviews#4879), it would be convenient to support a --version argument for the command-line interface that returns the current version of the code.

This should be straightforward, but in the interest of preventing oversight in future updates, it would be nice if we only had to update the version number of the project in one place (specifically, in pyproject.toml). @rbturnbull, do you know if there's a preferred way of reading a package's version number within Python for this purpose? I'm seeing several suggestions in the thread at python-poetry/poetry#273, so I wanted to see how you approach this problem.

I've opened a code-revision branch where we can make this change.

Should teiphy convert all variation units, including trivial ones?

Since we decided in #56 that ascertainment bias correction should be handled by the user's phylogenetic software rather by teiphy, it is worth asking if teiphy should convert all variation units, and not just those with more than one substantive variant reading (after collapsing readings of ignored types with their "parent" readings), to sites in the output file, in case these are needed for ascertainment bias correction.

If we decide this is necessary, then this change should be tested in a dedicated branch, as whether it is worth implementing will depend on whether it causes any of the software tested in the GitHub workflows to throw any errors. (I know that some programs will at least issue warnings about constant sites.)

add tip date sampling for beast

See:

think about name?

it would be good for the executable to have a shorter name than tei-collation-converter. perhaps we could use teiphy or teinex. both are available on PyPI

refactor into separate files

Hi @jjmccollum - the teiphy.py file is quite big. I think it would be good to split it up into separate files. How about one file per class. i.e. Collation can be in collation.py and Witness can be in witness.py.

I'd do the same with the tests and have a test_witness.py and a test_collation.py etc.

What do you think?

Example in docs

The last thing I think we need in the docs is a bit more info about how to use teiphy with different phylogenetics packages. I thought we could give examples from the Ephesians test XML file. I've written up stuff, taking info from the paper and the CI/CD. I'll commit it to a new branch in a moment.

Update documentation

Per the reviewer comments at openjournals/joss-reviews#4879, we should add documentation on how to run the unit tests. From the GitHub repository, this is done via the command

poetry run pytest

@rbturnbull, is this all we need to add to the documentation, or is there some other way that we can run the unit tests if we've installed teiphy via pip?

add code of conduct and contributing info

decide on licence

deploy docs with github actions

Add support for BEAST XML output

For use cases with BEAST (2) as the target phylogenetic software, conversion to NEXUS followed by a second conversion through BEAUti is presently supported, but direct conversion to a BEAST XML input file would allow for the mapping of additional features, the most notable of these being variation unit-specific substitution models and additional parameters to be incorporated into these models.

Because of the extensive nature of BEAST XML files, the conversion process will involve starting with a template file and adding new elements for witnesses, including fields for their sequences and date calibrations, and root frequencies and substitution models for each variation unit.

This feature may will probably take extra effort to implement, so this effort should be undertaken on a dedicated branch.

Anticipate encoding of ambiguous readings using `witDetail`

Where one or more witnesses have a gap or a nonsense reading that could be disambiguated as more than one substantive reading, this situation should be encoded in a TEI-friendly way. The TEI Guidelines (https://www.tei-c.org/release/doc/tei-p5-doc/en/html/TC.html#TCAPWD) describe a witDetail element parallel to lem and rdg elements that would be suitable for this purpose: the element includes a wit attribute (for one or more witnesses described by its detail) and a target attribute (which can point to one or more readings that might disambiguate it). For example:

<app xml:id="B10K1V3U22-26">
    <lem><w>ο</w><w>ευλογησας</w><w>ημας</w></lem>
    <rdg xml:id="B10K1V3U22-26R1" wit="P46 01C1 02 03 06 010 012 018 020 025 044 056 075S 0142 0150 0151 0278 0319 1 6 18 33 35 38 61 69 81 88 93 102 104 177 181 203 218 263 296 322 326 330 337 363 365 383 398 424 436 442 451 459 462 467 506 606 629 636 665 915 1069 1108 1115 1127 1175 1240 1241 1245 1311 1319 1398 1505 1509 1573 1611 1617 1718 1729 1739 1751 1836 1837 1838 1851 1860 1877 1881 1886 1893 1908 1910 1912 1918 1939 1959C 1962 1963 1985 1987 1991 1996 1999 2004 2005 2008 2011 2012 2127 2138 2180 2243 2344 2352 2400 2464 2492 2495 2516 2523 2544 2576 2805 2865S L156 L169 L587 L809 L1159 L1178 L1188 L2058 syrh AthanasiusOfAlexandria CyrilOfJerusalem RP SBL TH TR Tisch Treg WH"><w>ο</w><w>ευλογησας</w><w>ημας</w></rdg>
    <rdg xml:id="B10K1V3U22-26R1-v1" type="reconstructed" wit="94"><w>ο</w><w>ευ<unclear>λ</unclear>ογησας</w><w>ημας</w></rdg>
    <rdg xml:id="B10K1V3U22-26R1-f1" type="defective" wit=""><w>ο</w><w>ευλογησης</w><w>ημας</w></rdg>
    <rdg xml:id="B10K1V3U22-26R1-f1-v1" type="reconstructed" wit="1959*"><w>ο</w><w>ευλογησ<unclear>η</unclear>ς</w><w>ημας</w></rdg>
    <rdg xml:id="B10K1V3U22-26R2" wit="664 1490 1678 1831 1840 L1440 L2010"><w>ο</w><w>ευλογησας</w><w>υμας</w></rdg>
    <rdg xml:id="B10K1V3U22-26R2-f1" type="defective" wit="1721"><w>ο</w><w>ευλογησας</w><w>υμιας</w></rdg>
    <rdg xml:id="B10K1V3U22-26R3" wit="01*"><w>ο</w><w>ευλογησας</w></rdg>
    <witDetail type="ambiguous" target="#B10K1V3U22-26R1 #B10K1V3U22-26R2" wit="256"><w>ο</w><w>ευλογη<gap unit=" word" extent="part" reason="lacuna"/></w><w><gap unit="word" extent="part" reason="lacuna"/>μας</w></witDetail>
</app>

While the pointers in the wit and target attributes should technically point to unique elements (which, within the XML collation document, would be xml:id values prefixed by the # character), in practice, we may assume the pointers to refer to n values (for witnesses or readings within the same app element) if they do not start with the # prefix. (This is especially convenient for New Testament textual critics, who use Gregory-Aland numbers to refer to manuscripts; XML guidelines prohibit xml:ids that begin with numbers.) So the following should also be supported (even if it is not strictly valid TEI):

<app xml:id="B10K1V3U22-26">
    <lem><w>ο</w><w>ευλογησας</w><w>ημας</w></lem>
    <rdg n="1" wit="P46 01C1 02 03 06 010 012 018 020 025 044 056 075S 0142 0150 0151 0278 0319 1 6 18 33 35 38 61 69 81 88 93 102 104 177 181 203 218 263 296 322 326 330 337 363 365 383 398 424 436 442 451 459 462 467 506 606 629 636 665 915 1069 1108 1115 1127 1175 1240 1241 1245 1311 1319 1398 1505 1509 1573 1611 1617 1718 1729 1739 1751 1836 1837 1838 1851 1860 1877 1881 1886 1893 1908 1910 1912 1918 1939 1959C 1962 1963 1985 1987 1991 1996 1999 2004 2005 2008 2011 2012 2127 2138 2180 2243 2344 2352 2400 2464 2492 2495 2516 2523 2544 2576 2805 2865S L156 L169 L587 L809 L1159 L1178 L1188 L2058 syrh AthanasiusOfAlexandria CyrilOfJerusalem RP SBL TH TR Tisch Treg WH"><w>ο</w><w>ευλογησας</w><w>ημας</w></rdg>
    <rdg n="1-v1" type="reconstructed" wit="94"><w>ο</w><w>ευ<unclear>λ</unclear>ογησας</w><w>ημας</w></rdg>
    <rdg n="1-f1" type="defective" wit=""><w>ο</w><w>ευλογησης</w><w>ημας</w></rdg>
    <rdg n="1-f1-v1" type="reconstructed" wit="1959*"><w>ο</w><w>ευλογησ<unclear>η</unclear>ς</w><w>ημας</w></rdg>
    <rdg n="2" wit="664 1490 1678 1831 1840 L1440 L2010"><w>ο</w><w>ευλογησας</w><w>υμας</w></rdg>
    <rdg n="2-f1" type="defective" wit="1721"><w>ο</w><w>ευλογησας</w><w>υμιας</w></rdg>
    <rdg n="3" wit="01*"><w>ο</w><w>ευλογησας</w></rdg>
    <witDetail type="ambiguous" target="1 2" wit="256"><w>ο</w><w>ευλογη<gap unit=" word" extent="part" reason="lacuna"/></w><w><gap unit="word" extent="part" reason="lacuna"/>μας</w></witDetail>
</app>

Support filling corrector text based on previous witness's text

Depending on the application, we may want to treat correctors (e.g., GA 424C) as fragmentary witnesses whose text is defined only where it appears and is "lacunose" otherwise, or we may want to treat them as fuller witnesses that assume their base witnesses' texts (or, for later correctors, the previous correctors' texts) where they do not explicitly have their own readings.

Ideally, this would be a command-line argument for the user to specify.
On the TEI XML end, this rule could be encoded with additional witness entries in the listWit with special types (e.g., "corrector"), placed after their respective parents, as follows:

<witness n="424"/>
<witness type="corrector" n="424C1"/>
<witness type="corrector" n="424C2"/>

Then, when we generate the collation matrix, we could optionally match the corrector witness's reading to the previous witness's reading whenever the corrector witness has no reading(s) of its own.

Relative path to chron file for stemma output

Hi @jjmccollum - at the moment the stemma chron file is given as an absolute path in the stemma output. Is it possible for this to be a relative path from the stemma file so that if you move the directories (or post them online) the paths can still work?

convert to using pyproject.toml/poetry

Failure with no user-oriented error message

Describe the bug
When I run teiphy on a sample document specifying the output file temp.nex, I get what appears to be a Python stack trace (highlighting lines main.py:111, collation.py:1113, and collation.py:325), followed by the message "ValueError: max() arg is an empty sequence".

If I specify temp.phy as the output, the stack trace hits main.py:111, collation.py:1126, and collation.py:625.

If I specify temp.fasta as the output, the stack trace hits main.py:111, collation.py:1129, and collation.py:695.

If I specify temp.csv or temp.tsv, I get no stack trace and a three-character output file consisting of two double quotes and a newline charater.

To Reproduce
This happens whenever I try to run teiphy on my TEI document.

Expected behavior
My initial vague expectation (this is my first attempt to use teiphy) was that I would get output.

Failing that, I was hoping for a message to suggest what has gone wrong. Maybe my input is not as expected, maybe my setup is not as expected.

Error message
I'll attach a log file.

Environment

System: Linux (Pop!_OS 22.04 LTS - an Ubuntu offshoot)
Python Version: python3 --version returns "Python 3.10.6"
Code version or commit hash: Installed today (22 Dec 2022) using pip install teiphy, so I guess it should be the release of 18 Dec 2022.

Additional context
I am not a regular Python user, so I may easily have failed to do something that any rational Python user would know needs to be done. I have not used teiphy before, so I don't know that my input file satisfies the expectations of the software. I would consult the documentation to find out what those expectations are, but I do not know where to find it. (Aha. Clicking on the link on the repo splash page takes me to an online version of the documentation. I don't see any description there of expectations regarding the TEI input document.)

In other words, the error here is almost certainly mine, or something odd in my environment; I am reporting it as an issue only because I don't know where else to get help. (And my problems may suggest ways in which the software and documentation might be bullet-proofed against cheerful idiot users like me.)

I have placed the input file in input.zip, since Github seems not to like the idea of attaching an XML document to an issue.

input.zip
installation-20221222.log
run.log

Add support for NEXUS-style CSV/Excel output

Peter Montoro has requested support for a more compact version of the usual CSV/Excel output format that resembles NEXUS sequences. This would have the very nice benefit of easy filtering for sequences of specific readings at specific variation units. Implementation details follow:

This format should have a row for every witness, a column for every variation unit (keyed by ID), and cell values corresponding to a given witness's reading ID.
This output format should support the ambiguous_as_missing option, as space-separated reading IDs should be allowed in a CSV/Excel file.
The different types of CSV/Excel outputs should be grouped together under a TableType class that implements an Enum (similar to the ClockModel class) so that these options are mutually exclusive. Once this feature is implemented, the options should be something like matrix (witnesses x readings, numerical cell values; this is the default option), nexus (witnesses x variation units, reading ID cell values); and long (many rows consisting of witness, variation unit ID, reading index, and reading text entries).

system tests in CI

I'll set up a github action workflow to run the output of teiphy with iqtree. @jjmccollum - you can use that as an example for other programs like stemma and MrBayes.

clean up XML file

Add support for PHYLIP and FASTA formats

Per the latest reviewer comments (openjournals/joss-reviews#4879), support for PHYLIP and FASTA formats (used by programs like PHYLIP, RAxML, and IQTREE) would be convenient. Format parsing and output methods for these formats should be added to the code.

These changes can be made on the code-revision branch.

add docstrings with consistent format

I recommend the Google docstring format convention.

Incorporate latest feedback on paper

Per the latest comments at openjournals/joss-reviews#4879:

the format for Carlson's STEMMA is described as "unique" (lines 13 and 74), which I assume means "exclusive to". The adjective seems superfluous, not only because the format is not "unique" but also because the audience of the paper will most likely know about the format.
There is no dispute that critical texts are fundamental to classical phylology (line 35), but the authors use "digital humanities" in a stricter sense than the one adopted by most people, and more as "the on-going digitalization of philological work". This is a common usage in the field of which I am also guilty, but as it has been pointed to me in the past it might be worth talking about "digital turn in philology" or something alike, reserving DH for the sense of "intersection of computing and humanities in general"
In line 53, it is possible to say that no currently available phylogenetic software expects input in TEI XML, stressing the value of this contribution
In line 54, it would be worth mentioning that the format was conceived not only for versatility, but also for promiting digital adoption in stemmatology
In line 64, PAUP* is not cited
In line 85, from experience there are cases where researchers might prefer or need a long table format, transposing the table (it is not practical for human inspection in most cases, but I would not dismiss it)
Installations instructions are given in the repository README file and in the documentation, but not in the draft. It is not an issue, given that it is documented and the procedure is easy and intuitive, but explict instructions are missing.

In addition, we have the following paper revisions related to the code revisions:

Update discussion and examples of the --states-present option, as we now have a --frequency option instead.

These changes should be made on the paper-revision branch.

Add support for `--ascertainment` option

Per the latest comments on openjournals/joss-reviews#4879:

The authors briefly mention how some phylogenetic software (like IQTree) can automatically perform ascertainment correction (usually with the Lewis MK model), but an --ascertainment flag would be very welcome for the NEXUS output format in general (or even for all formats). Not only some software expects the user to perform the correction themselves by adding the states to the alignment, but in some cases we might prefer to have it in the model even when the software offers it as an option.

I suspect that in NEXUS format, options like this would be specified in the ASSUMPTIONS block. If this specification is dependent on the target software (like IQ-TREE, MrBayes, or BEAST), the separate options may have to be developed for each software. I haven't yet found examples of NEXUS files that include these options.

These changes should be made on the code-revision branch.

Make `--states-present` default option for NEXUS output

Per the latest comments on openjournals/joss-reviews#4879, the most common expected use case for NEXUS outputs is the use of state symbols as opposed to state frequency vectors. For this reason, NEXUS outputs should be generated with the setting StatesFormat=StatesPresent by default. Instead of a --states-present command-line option, we should offer a --frequency option that sets StatesFormat=Frequency instead.

These changes should be made on the code-revision branch.

add sphinx docs and move info from readme there

add black pre-commit hook

Add `--long-table` option for tabular output

Per the latest comments on openjournals/joss-reviews#4879, it would be convenient to add a --long-table option for generating tabular output formats (NumPy, Pandas DataFrame, CSV, TSV, Excel) not in state frequency form, but in the form of a long table of value tuples, as follows:

Taxon,Character,State,Value
UBS,B10K1V1U24-26,0,εν εφεσω
P46,B10K1V1U24-26,1,om.
01,B10K1V1U24-26,1,om.
02,B10K1V1U24-26,0,εν εφεσω
03,B10K1V1U24-26,1,om.
04,B10K1V1U24-26,?,
UBS,B10K1V6U22-26,0,εν τω ηγαπημενω
P46,B10K1V6U22-26,0,εν τω ηγαπημενω
01,B10K1V6U22-26,0,εν τω ηγαπημενω
02,B10K1V6U22-26,0,εν τω ηγαπημενω
03,B10K1V6U22-26,0,εν τω ηγαπημενω
04,B10K1V6U22-26,1,εν τω ηγαπημενω υιω αυτου

This change should be made on the code-revision branch.

logo/banner

Hi @jjmccollum - it might be good to have a logo/banner. We can add a banner to the github repo so that the logo shows up when people share it on social media. Here's a draft banner which incorporates the TEI logo (https://tei-c.org/about/logos/). That logo is licensed under CC and it allows edits. The logo isn't pretty but it might be good to tie it in with TEI.

Are you happy with this or should we keep working on it

convert from argparse to typer

Assignment of random starting rate parameter values, new prior distributions, and support for `--seed` option

For larger collations with increasingly varied substitution matrices at different sites/variation units, it becomes increasingly likely that at least one of the substitution matrices will be singular under the initial assignments of transcriptional rate parameters (which presently all default to 2.0). This ceases to be a problem after BEAST 2's first evaluation of the site likelihoods, because in subsequent likelihood calculations, it samples these rate parameter values randomly according to a prior distribution, and the resulting substitution matrices will almost never be singular. Since BEAST 2 XML inputs require that every RealParameter element have an initial @value attribute specified, we need to supply these values, but we should be able to avoid singular matrices if we set these values randomly in to_beast.

This can be done by using distributions available in numpy, which is already a dependency of teiphy. I suggest replacing the current offset log-normal prior distribution assumed for transcriptional rate parameters with a simple gamma distribution (perhaps with alpha="5" and beta="2" as default values). Then we can sample random initial values for the rate parameters from this distribution using something like

rng = np.random.default_rng(seed)
sample = rng.gamma(5.0, 2.0)

To ensure replicability for generated outputs, we will need to add support for a --seed command-line argument.

adding help text for input and output in main.py

I think there should be some help text for the input and output in the CLI. Also, is it OK if we call them just input and output - or perhaps input_path and output_path? I'm not sure if it is common to call them an addr for address.

add github actions for unit tests

docs for command line flags

Hi @jjmccollum - in the docs you have shown examples with this convention for the CLI:

-t"reconstructed" -t"defective" -t"orthographic" -t"subreading"
I think it might be more common to use this convention with spaces between the tag and the input.

-t reconstructed -t defective -t orthographic -t subreading
This way we don't need to have the quotes. Both work equally well but this might be clearer and more standard. What do you think?

Fix github actions workflow for iqtree

The iqtree pipeline has started failing because no iqtree is in the path after installing it with apt-get. This might be related to the version being installed (2.0.7+dfsg-1).

Create directory for output if needed

Hi @jjmccollum - are you happy if I submit a PR to create a directory for the output file if it doesn't exist already?

simplify CLI to one command

Hi @jjmccollum - there is quite a bit of duplication in main.py. I think we can just have one function and it can infer the type from the extension of the output file and this can be overridden on the command line. I'll do something and do submit a PR to show you what I mean.

changing the name of the py directory

It is normal convention for the python code to live in a directory that is the same name as the package. It is also possible to call it 'src' but I think I'd prefer to follow the standard convention. Are you happy to rename 'py' to 'teipy'?

add unit tests

refactor for calling directly in python rather than from the command line

autogenerate docs from classes

I can give this a go and submit a PR so you can see what I've done

Add export methods for other output formats

For use within Python, a to_numpy export method would be useful; this would allow us to convert a TEI XML collation to a ready-made collation matrix input to machine learning packages (e.g., nimfa for non-negative matrix factorization).
It could also serve as a stepping stone for to_csv and to_xlsx methods via pandas.

As for STEMMA, I'll have to look more closely at the structure of the files in https://github.com/stemmatic/mss to see if TEI XML is rich enough to make the conversion straightforward; it may be more of a challenge.

example with fewer witnesses and add into paper

CHARSTATELABELS block?

I think some phylogenetics programs need the CHARSTATELABELS block filled out. (I thought BEAUTi needed it for example). Is it possible to include it. I've got an example here: https://github.com/rbturnbull/phylopaul/blob/main/1Corinthians/1Corinthians.nexus
I was lazy with that example and I just called the readings/states State0 or State1. It might be good to slugify the readings and have them in the states. I also include the site/character label in as a comment in square brackets. There should be a command line option to not have the CHARSTATELABELS block if the user doesn't want it. What do you think?

Revise paper

Per the latest reviewer comments (openjournals/joss-reviews#4879), we should expand the Use Case section to discuss the UBS dataset and the results depicted for it in Figure 1. We could include the results from other software in other figures to highlight the consistency of several groupings; including the output of stemma would also allow us to talk briefly about how known contamination in the dataset is accounted for).

Aside from this, several minor revisions were recommended. I have reproduced the checklist here:

I would like more detail on the example, preferably with worked through command line steps. This is already in the example section of the docs, so please add this into the paper too (especially the step of generating a nexus file, showing and explaining the command line arguments).
Figure 1, and discussion: can you please describe what this tree show us (e.g. do these groupings make sense, do they tell us something interesting about this document?)
L16: please define what the acronym "TEI XML" means.
L37 "since then, phylogenetic methods have quickly evolved (Felsenstein, 2004),". Please reword -- this reads as if phylogenetic methods were driven by textual analyses, whereas, these methods have been in continuous development since the 70s and 80s, and not for textual analyses.
L52 "great chasm .. fixed". Please edit - it's a bit over the top, and/or the verb "fixed" is awkward. Perhaps "exists" or "hinders the use of TEI XML in phylogenetic contexts".
L60: TNT's claims of 'remarkable performance' may have been true in 2008 but now there are plenty of genomic-scale methods out there that can outcompete TNT (something like RAXML-NG probably blows it out of water for example). A better argument for TNT is that it has a comprehensive of morphological state models that may be appropriate for digital humanities data. You later go on to use tools like IQ-Tree and MrBayes, so it'd be worth broadening this section to mentioning these tools up front too. Don't let people think your program is innately tied to "cladistic" tools and approaches (and I mean "cladistic" here in the pejorative sense).
L72: Might be worth noting that this provides an immediate pathway into all the machine learning tools available in python (e.g. scikit-learn, tensorflow etc).
L75-76: Can you briefly explain what 'lacunae' and 'retroversions' are (this also strengthens the case for your program). Another term 'corrector' is used later in the m.s. without explanation as well.
L79: what do you mean by 'soft decision'?