Define rules for a stable taxonID. Understanding when a taxon changes sufficiently to

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Comments (129)

mjy commented on May 29, 2024 1

"Fundementally doomed indeed. :("

Types frantically for 20 minutes than deletes everything.

from general.

mjy commented on May 29, 2024 1

@rdmpage Right, point taken, larger classification not needed. I think given this fact there is nothing preventing the experiment to start right now:

Grab a GSD for each of a small, medimum, and large clade. Repeat 10 times, drawing data from the 10 years of GSD submsissions (these are are variously archived). @gdower can likely help get the datasets if someone wants to try.
Run the experiment. Visualize the results.
Tweak the metric, optimize for maximum stability, repeat.

I.e. there are no bottlenecks beyond time to this experiment.

from general.

deepreef commented on May 29, 2024

To define what is meant by a "taxon" instance (to which a taxonID is assigned), we need to establish what are the "core proprties" of an instance of a "taxon", whereby if one of the core properties changes, a new taxonID must be issued. I think it's best to narrow the scope of those properties to representing the "contents" of a taxon, rather than the combination of contents and "context". "Contents" in this sense are the items contained within the circumscription of a taxon. For example, a taxon representing a genus would be defined by the set of species contained within it. For example, two different assertions of a genus contain different sets of species: Aus sensu Smith contains (Aus bus+Aus cus) with species "dus" placed in genus "Xus"; whereas Aus sensu Jones contains (Aus bus+Aus cus+Aus dus); then Aus sensu Smith would have a different taxonID from Aus sensu Jones because they have different contents. "Context" in this sense means placement within a hierarchical classification. Changing the context of a taxon instance should not cause a change in taxonID. For example, if Smith and Jones both assert the same contents of the genus Aus (e.g., A.bus+A.cus+A.dus), but Smith places the genus in the family Aiidae, and Jones places the genus in the family Xiidae, we do not need a different taxonID to represent Aus sensu Smith and Aus sensu Jones. Logically, this means that for a species-level concept, if the circumscriptons of both Smith and Jones for the species "bus" are the same (i.e., same heterotypic synonymy), then they have the same taxonID even if Jones treats it as "Aus bus" and Smith treats it as "Xus bus". This needs to be fleshed out in a full document.

from general.

mdoering commented on May 29, 2024

I agree with excluding the position in the classification, the "context", from the taxon concept identity. A for the included species we should find a way to allow newly described species to be added to a genus without change its identity as long as the species were not moved from another genus. See also

As for a final output I agree this needs to be written up somewhere else, probably as part of the API documentation. But in general I would like to get an agreement on the key points in these issues first instead of creating lots of documents that tend to consume a lot of overhead just for styling and explaining the context.

from general.

deepreef commented on May 29, 2024

Yes, exactly! Dave and I discussed this at some length at Woods hole. What it boils down to is this:
When a new species is described, what would its type specimen have been identified as prior to the new species description? If it would have been identified as an earlier-named species, then what we have is a case where one larger species was split into two smaller species. As such, the circumscription of the genus doesn't change. On the other hand, in the cases of "brand new" species, which would not have have had ANY taxonomic identity prior to their description, then the concept for the genus would need to change. Obviously, this is subjective in many cases, and not obvious. But from an informatics perspective, I think the cleanest answer involves how the link between names and their corresponding name-bearing types are made. I suggested to Dave that we could have a "retroactive identification" system, whereby it can be asserted that the type specimen of a new species would have been identified as species "X" prior to the new species description. This can be proxied through assertions of heterotypic synonymy, if we don't want to get all the way doen to identifications of specimens. I will take some time this weekend to come up with a better diagram than what is shown above. I actually started one in Woods Hole, so I will finish that and then share it.

from general.

deepreef commented on May 29, 2024

By the way, if we can address this informatically, then we will have also created a REALLY valuable tool to distinguish "true" new species from "split" new species. An analogy of the difference is the two kinds of "Gaps" in Col (actual taxonomic gaps, vs. synonym name gaps). This is important because it distinguishes cases of new species that increase our understanding of the scope of biodiversity, vs. cases of drawing new lines within our already-existing understanding of the scope of biodiversity. We've never been able to do this before.

from general.

mdoering commented on May 29, 2024

Would we want a genus concept to change when a brand new species gets added? It would mean genus identifiers change quite a lot over time and we lose stability. It might be more useful to restrict changes of genus concepts to true splits & merges of genera, ignoring the exact amount of included species for the most part and focus on the genus types as we discussed at some point. This needs real world examples to test

from general.

deepreef commented on May 29, 2024

It depends on how much you want to reflect reality, and it also depends on what you mean by stability. The undortunate reality is that the meaning of a genus-level taxon concept DOES change when a truely new species is added. However, if we want less precise but more stable taxon identifiers for genera, then we can treat them the same way as species. That is, instead of defining them by the circumscription of all individuals, we can limit the definition to be circumscriptions of types (stype species for genus concepts, and type specimens for species concepts). Unfortunately, as we discovered in our discussions at Woods Hole, we lose important information about taxa when we fail to distinguish the case of one species-level taxon that is split into two, vis a brand new species being added (impetus for the diagram in the photo you included above).

Also, "stability" is actually INCREASED with increased precision, because there is less subjectivity in the definition. The problem isn't a loss of stability, the problem is a proliferation of subtle variants (e.g., Aus Smith sec. Smith vs. Aus Smith sec. Jones). All of these variants are themselves stable; but they confuse matters because we have no good way to reflect the differences in meaning between two precisely-defined genus-level taxon concepts.

from general.

mdoering commented on May 29, 2024

Right, the genus concept changes when a new species is added when you look at the included species. But is this really useful for anyone?

It seems to me it is rather about delimiting a genus to other genera that is important here to define the concept. Merging and splitting again. For example the genus Acacia can be referred to as the concept sensu latu including all species nowadays in Vachellia or sensu strictu when you also acknowledge the existence of Vachellia.

from general.

deepreef commented on May 29, 2024

Personally, I'm happy with defining a taxon by the set of "types" it contains. That is, a "species" concept represents the sum of the species-group protonyms (as proxies for type specimens) assigned to it as heterotypic synonyms, and a "genus" concept is the set of genus-group protonyms (as proxies for type species) assigned to it as heterotypic synonyms. To me, that solves 80% of the problem with 20% of the effort. However, as we discussed in Woods Hole, this completely misses the ability to descern the "sensu lato/sensu stricto" cases where an existing species is split into two. That is, no way to distinguish between "Aus bus Smith sec. Smith" (sensu lato) from "Aus bus Smith sec Jones" (sensu stricto) -- when Jones splits Aus bus into Aus bus Smith sec. Jones and Aus dus Jones sec Jones. The same applies to all ranks (Genus and above).

Like I said, limiting it to heterotypic synonymy gets 80% of the job done with 20% of the effort. If we want to go beyond that, I think it would be better handled by a system of "RelationshipAssertions" (sensu TCS).

from general.

mdoering commented on May 29, 2024

Three implementations dealing with tracking taxon concept changes:

from general.

mdoering commented on May 29, 2024

Should the identity stay if just the name changes? E.g. some of the synonyms gets accepted or if the name changes its rank, e.g. a species will be considered a subspecies now? Type and concept wise these are the same so the identifier should not change, correct?

from general.

ThierryBourgoin commented on May 29, 2024

As we discussed already, but too briefly in Woods Hole, I think that defining a taxon (=concept) by its content is not enough or even may be useless. A taxon (e.g. genus) has its own definition. Adding or removing a species that fits with its definition does not change the taxon definition: it remains the same while its sum has changed! In other terms trying to define a taxon by the sum of its species is not so useful: different sums could lead to the same taxon and then the same UI ! which is not what we want I suppose. I might be wrong but I don’t see this practicable in the issue of UIs. Additionally (even if it would be probably the best to do) I don’t think that we going to suggest changing the UI each time we are adding/removing a species to a genus. In reverse, with its own definition a taxon carries a series of implicite characters that link it into a special place into the hierarchy (classification of phylogeny). If you change the place where you hang this taxon, you change all these implicit characters that define the taxon = you change the full/complete definition of the taxon -> you change the taxon. I feel that these are the changes which are really necessary to tract, the ones that are important for CoL. Not sure I’m clear here ;-)

from general.

mdoering commented on May 29, 2024

@ThierryBourgoin I see your point and it makes a lot of sense. There are various ways to look at what the essence of a taxon is and exactly this is why we need to agree on one definition.

We should probably step back and approach the problem from a users perspective. What does a user want from a CoL taxon and why does it need an identifier at all?

someone uses the catalogue at some point and wants to have a persistent reference to the exact version he was looking at that time. That would require a fully versioned CoL with every change triggering a new identifier.
people have identified an organism to a CoL taxon, e.g. a specimen or observation. They want access to the current view of the "same taxon" in the CoL that still represents that organism observed. But maybe with a different name, classification or other updated "metadata". This does not require a taxon concept id per se, just a way to get to the (different) identifier for the latest version of the same concept. The concept identifier basically is internal only - but the system still needs to know about concepts. This mostly applies to species- and infraspecific taxa so we probably would not need to worry about higher taxa, but maybe genera.
researchers want to aggregate species related information from different systems, all linked to CoL taxa. They want to be sure the different systems talk about the same taxon concept and information can safely be transferred and merged. This seems to require shared concept ids.

From the above I feel we need 2 identifier, one for the exact version and one for the taxon concept to assert a concept is the same.

The question now is how to know that a concept (as in set of all theoretically included individuals) is the same. We can either find a way to automatically detect that or rely on experts to tell us. The problem with experts is that they will apply different judgments to what concepts are. So we will see very inconsistent, equal concepts across various groups. Sth that can be asserted by a computer will be much more useful as its predictable and comparable across all groups.

from general.

deepreef commented on May 29, 2024

Thanks, @ThierryBourgoin and @mdoering -- this is helpful. This conversation is touching on the same problems of communication that have plagued these discussions for several decades now (going back at least to the 1980's). Fundamentally, is that we have different ideas about two issues:

Issue 1 is about what "things" (conceptual entities) do we care enough about to label with a persistent identity. Included within this issue is the question of how to explicitly define these "things", so we know when the properties of one thing (represented by its persistent identifer) should be changed (without changing the identifier), vs. when a new "thing" is needed (with its own distinct identifier). At the heart of this issue is which properties of a "thing" define it (i.e., collectively represent its "essence"), and which merely represent relevant metadata associated with that "thing", which may be altered without altering the essence of the "thing".

Issue 2 is about semantics, that is, which terms do we use to label each class of "thing". The most problematic terms are "name" and "concept". Both have various synonymns and homonyms in our conversations. What has become clear as a result of MANY conversations almost exactly like this one is that we probably have five or six different classes of "things" that we have, over the years, tried to force-fit into two terms ("name" and "concept").

My fear is that if we do not confront these two issues now, we will make very little progress solving these problems from an informatic perspective. Having dealt with these issues (from an informatics perspective) for many years, these are the "things" that I have found useful for persistently representing conceptual objects in the biological taxonomy realm:

Thing 1: An individual human being, or an entity representing an organization created by human beings. I have used the term "Agent" to refer to this Thing.

Thing 2: A text-string label used to represent an instance of Thing 1 ("Agent"), often parsable into "Surname" and "GivenName" (for people), or a hierarchy of names (for organizations). I have used the term "AgentName" to refer to this Thing.

Thing 3: Documentation instance representing assertions made by one or more instances of Thing 1 ("Agent"), at a particular moment in time. The documentation may be a type of publication, or it may be some other form of static documentation. The word "static" here is critical, because the documentation instance represents a snapshot in time, and thus does not change. For retrieval purposes, it is best to associated each instance of Thing 3 with instances of Thing 2 (AgentName), instead of directly to instances of Thing 1 (Agent). I have used the term "Reference" to refer to this Thing.

Thing 4: A string of text characters, typically represented electronically in the form of UTF-8 encoded text, or printed in the form of glyphs rendered as ink on paper, which serves as a Linnean-style scientific name. These text strings may or may not include components representing taxonomic rank, delimiters (such as parentheses), and authorship information (various styles, formatting and with or without years). I have used the term "NameString" to refer to this Thing.

Thing 5: A specific instance of a Linnean-style taxon name represented as a conceptual entity. This applies to a particular unit of a compound name (not the full combination), which has a particular type (specimen or name) in the context of Codes, a particular rank (in the sense of Linnean ranks), and a particlar authorship associated with the creation of the name. This is different from instances of Thing 4 (NameString) in that it is conceptual, not literal. The essence of an instance of Thing 5 is independent of the text string used to represent it. For example, the same instance of Thing 5 might be represented by different text strings (e.g., different genus combinations for a species, different ranks, different spellings, etc.), and more than one instance of Thing 5 might share the same text string (e.g., homonyms, homographs). I have used the term "Protonym" to refer to this Thing.

Thing 6: A particular treatment or usage of an instance of Thing 5 (Protonym) within the context of an instance of Thing 3 (Reference). Important properties of instances of Thing 6 include the exact spelling of the specific name unit (e.g., the species epithet) as it appears within the instance of Thing 3 (Reference), what taxonomic rank the instance of Thing 5 (Protonym) was asserted as within Thing 3 (Reference), Whether or not the instance of Thing 5 (Protonym) was treated as as a valid taxon, or as a heterotypic synonym of another taxon, and a link to another instance of Thing 6 representing the immediate hierarchical taxonomic parent (e.g., the genus into which a species is placed). I have used the term "TaxonNameUsage" to refer to this Thing, but it could also be referred to as "TaxonTreatment" or just "Treatment" (following how PLAZI uses that term).

Thing 7: The set of biological organisms, including individuals that are dead, alive, and yet-to-be-born, which are explicitly or implicitly included within an asserted Taxon. THIS IS THE THING ABOUT WHICH WE ARE DISCUSSING Most people I have discussed these issues with over the years have applied the term "TaxonConcept" and "Circumscription" interchangably to refer to this Thing. However, as per @ThierryBourgoin comments above, perhaps we do not have universal agreement that "Concept" and "Circumscription" are synonymous terms. Therefore I propose we use the term "Circumscription" to represent this Thing, to avoid confusion going forward.

Thing 8: This is the Thing that @ThierryBourgoin refers to in his comment above as a "Concept". Basically, its properties include elements of both Thing 7 (Circumscription, or set of included child entities), as well as Thing 6 (TaxonNameUsage/Treatment), such as the hierarchical classification, treatment as valid or not, and how the name is spelled. Therfore, it is different from Thing 7 (Circumscription) because it is defined by more than just the child items it contains, but it's not the same as an instance of Thing 6 (TaxonNameUsage/Treatment), because there many be many instances of Thing 6 (TaxonNameUsage/Treatment) that all imply the same instance of Thing 8.

I apologize for this long post, but there is a reason we've never solved this issue as a community during the past few decades. Unfortunately, most of that reason has to do with miscommunication, and most of the miscommunication has to do with a mixture of how we define our core objects (Issue 1) and what terms we use to represent them (Issue 2; i.e., semantics).

I believe that we already have well-tested, non-contentious definitions for Things 1, 2, 3, and 4. After the dinner conversation in Woods Hole, I am confident we can fairly quickly settle on a clear definition for Thing 5. If we can achive that, then the definition of Thing 6 is extremely easy. Therefore, the real issue for us to deal with is whether Thing 7 and Thing 8 need to be different Things, or if we can adequately accomodate them with a single Thing. Originally I thought we could get by with a single Thing, but after the comment by @ThierryBourgoin and @mdoering above, it seems we should serious consider defining them as separate things, each with their own identifiers.

In either case, I think it's important that we understand the difference between defining what Things we need to manage in CoL-Plus, and deciding which terms to use to refer to those defined things. I think it would be a grave mistake to start defining data models and such until after we come to consenses on the Things we're managing, ans the terms we're using to refer to those things.

Phew... and this is just the BEGINNING of the discussion!

from general.

deepreef commented on May 29, 2024

One more point.... in response to the comment by @mdoering above, "versioning" of CoL representations can be handled in several ways:

Internally using version histories for the same identifiers plus a date-stamp;
Geneating new identifiers to represent each version;
Capturing each new version via a new instance of Thing 6 (with Reference representing CoL as the Author and the date of the change as the date, and the properties of spelling, validity, classification, etc.)

There are other ways as well, but #3 above represents the simplest in terms of coding and implementation.

from general.

mdoering commented on May 29, 2024

Linking the drawing from the Woods Hole CoL meeting April 2017 illustrating changing concepts (numbers) over time with types indicated by colored dots:

Original single species A.bus gets split into A.bus and A.fus. A.bus s.str is then merged with A.xus.
Knowing the types alone is not in all cases enough, otherwise A.bus s.l. (1) would be the same as A.bus s.str. (2). But when you know about all the species within the genus and know A.bus is also a pro parte synonym of A.fus you can derive the unique concepts

from general.

ThierryBourgoin commented on May 29, 2024

I think we need to be precise here about the words we use… (concept, step of concept

If I reed correctly the figure:
We have only here 3 different taxonomic concepts: A. xus, A. bus and A. fus.
1960: taxon A. bus s.l. is described (1)
1970: taxon A. xus (4) and taxon A. fus (3) are described. Some specimens of A. bus s.l. belongs to A. fus.
We have 2 new concepts (3) and (4) + 1 old concept (1) more restricted BUT still the same concept.
1980: A. xus is synomized with A. bus s.s ; A. fus remains. We have 2 concepts (1) in still another step, and (3).

1, 2 and 5 are different stages.steps of the same taxonomic concept.

Type-bus (red dot) is the same in all stages/steps of the life of the same taxon A. bus (s.l., s.s., and including A. xus).
So yes a type does represent all the stages of the life of a taxon, but this is not what it is supposed to do: it is just bearing the name for this taxon.
The type has nothing to do with the concept understanding, it is just the bearing-name specimen for this concept.
This specimen is only one in the many others that “make" the taxon, it provides the link between nomenclature and taxonomy.

In this example the taxonomic concept for A. bus remains the same, it just evolves in time according to its content (=extension) more or less restrictive (different steps/numbers of the same concept): succesive stages/steps: 1, 2 and 5.
=> a same concept may have different successive names according to its extension.

However concepts are defined by 1) their content (extension = set of children-taxa/specimens to which the concept applies) AND 2) also by intension (list of its characters = its description) - and not by the type specimen.
If a taxon is transferred to another parent taxon with its set of children-taxa (a genus from one tribe to another tribe, a species from one genus to another genus) it changes by intension (its characters/description are/is changed). Accordingly in that case this is no more the same concept ; we have 2 concepts: an old one and a new different one, although it keeps the same name! (excepted brakets in the case of species transfered in another genus).
=> a same name (particularly in supraspecific taxa) may refers to different concepts.

This is why 1) defining taxa by their extension only remains insuffisant (my issue in Woods Hole meeting) and 2) speaking of a taxon without referring to its classification (e.g. sec. author) might introduce strong biais if not even errors in any taxonomic database is we don’t take care of these very particular inferred links (my point/talk in Xishuangbanna meeting).

from general.

mdoering commented on May 29, 2024

@ThierryBourgoin so you say all 163 Acacia species that have been moved to the genus Vachellia should be considered different taxa describing a different set of organisms? Identifications to Acacia aroma cannot be safely transferred to Vachellia aroma as their circumscription is different?

from general.

mdoering commented on May 29, 2024

@ThierryBourgoin can you explain what you have in mind when the concept is more restricted but still the same concept? That sentence to me contradicts itself. If some specimens/organisms are excluded it is clearly different.

from general.

ThierryBourgoin commented on May 29, 2024

I try take an example fro what I've in mind:

Taxonomic concept of the giraffe (G. camelopardalis) has recently been disputed (and still is so far I know) and the species concept been ‘restricted' to the “Northern giraffe”, while 3 other species were recognized (reticulated, Southern and the Masai giraffe)…
I regard the initial taxonomic concept of what is G. camelopardalis (s.l.) being still the same but it has been restricted (s.s.) to the north African populations.

Let us say that new analyses will conclude in the future that it is not the case for 2 of them, the Southern and the MasaI taxa. Therefore these 2 separated species will come back ‘inside’ the taxonomic concept of G. camelopardalis which will be more widely understood than now but still less than originally.
These are just successive steps of in the circonscription of the same concept view by extension.

Now let us say that new analysis by author NNN would show that Giraffe is not a Ruminant (Ruminantiamorpha) and should be move from Giraffidae to whales in Balaenidae ;-)
Then Giraffa would be characterized by its own characters of course (the ones that allow to recognize the set of all its included subtaxa) but also by all the characters of Balaenidae and not the ones of Giraffidae. For me this new definition by intension (new list of characteristics of Giraffa, including those of Balaenidae) would make the taxonomic concept a totally different one for Giraffa sec. NNN.

I don't know if I could write it this way but in other words I would say that changing the content of a taxa does not change it (as a taxonomic concept), but changing its characteristics that it share with other taxa (what we do with taxanomic transfers) yes. From your example Acacia and Vachellia remains the same concept, respective with a more restrictive or wider understanding of their taxonomic concept, but Vachellia aroma and Acacia aroma are two different taxonomic concepts.

from general.

mdoering commented on May 29, 2024

Thanks @ThierryBourgoin, for identification purposes it is important that we capture the different opinions over time. In the terminology I propose here this means the concept of which populations are in and which are out does change, even though the type remains. In your example of a hypothetical merge of the Southern and Masal species back into G. camelopardalis we would actually have 3 different concepts over time, all known under G. camelopardalis. Referring to all 3 of them as the same concept would not allow us to deal with identifications accurately.

Take a look at iNaturalist to see why that is important for handling (historical) identifications:
https://www.inaturalist.org/pages/curator+guide#changes
Actual changes they track (unfortunately both Acacia and Giraffe are outdated): https://www.inaturalist.org/taxon_changes

A good bird example for a split based on distribution ranges:
https://www.inaturalist.org/taxon_changes/32924

from general.

mdoering commented on May 29, 2024

I do understand your point about intension. The classification should be significant in characters that define the taxon. But in many cases these do not alter the unit of populations that make up the taxon. The important part is that as long as the populations which make up the taxon do not change the taxonomic concept has not changed. Even if the circumscription might now include some more or less characters. The primary anchor point is the group of populations that form a stable unit, not how exactly we characterize them. From Wikipedia:

In biology, a taxon is a group of one or more populations of an organism or organisms seen by taxonomists to form a unit. It is not uncommon, however, for taxonomists to remain at odds over what belongs to a taxon and the criteria used for inclusion.

from general.

mdoering commented on May 29, 2024

As in practice it is difficult to assess whether a change in characters has an actual effect on the size of the included populations it probably makes sense in some cases to track these concepts in their minute details. But this leads us again to an explosion of concepts. Every identification key will define its own concept, every change in classification yet many more. For the purpose of dealing with identifications, whether observations in GBIF or specimens in collections, we would like to have more stable identifiers though than names, not less stable ones.

from general.

ThierryBourgoin commented on May 29, 2024

Hi Markus,

I think we agree.
A same taxa might have different changes in its concept (how it is understood), and tracking these changes is crucial.
But all these changes are not equal.

Any new specimen added to a taxa changes its concept: it extends its sample, eventually wider its distribution, add new associated biological data… restricts, extends or precise the taxonomic concept that supports the taxon. In fact, almost each time we handle a taxonomic concept (publication) we change it (= chrysonmyie, potential taxon). In practice we are not going to register/identify all these stages but it would be ideal.
Because of nomenclature issues, you mainly refer to cases when type specimens are involved/concerned. I agree with you: we need to identify separately these concept changes. They track the evolving definition of the concept by extension.
But transfer of taxa within the classification are also important (if not even more) from the pure taxonomy point of view (not nomenclatural) and we don’t track these ones at the moment. This is my point, which I think important for a taxonomic database issue. And these track definition of the concept by intension.
Extension and intension are equally important, both participate to the concept definition for a full representation of a taxon in a database I think.

;-), Th.

from general.

mdoering commented on May 29, 2024

yes. All 3 are probably best dealt with as different identifiers if you need all of them. I am just not sure if we do have users that need all of them. For number two I am sure we have.

from general.

dremsen commented on May 29, 2024

I'm very happy to see this thread back in action and wish to contribute constructively. I need to spend a bit more time reviewing all of this to get back in this frame of mind but I have two immediate comments.

I do not believe that the addition of a new specimen to a taxon changes the concept. The concept is not the specimen. The link between the identifier and the specimen is only through the concept. This is very clear within the famous Triangle of Reference model. In taxonomy, concepts are ideas expressed as publications (sometimes poorly) and anchored with the type. Specimens conspecific with the type are instances of the concept, not new concepts. This is why heterotypy must be the means by which concepts are expressed. The giraffe example is almost identical to the graphic example from Woods Hole (which shows five distinct concepts).

I remain unsettled regarding the higher classification being a property of the concept. Paul Kirk and Jerry Cooper were very resolute on this matter in regard to homotypic synonymy where a taxon was transferred to a different genus. No circumscription change and hence no concept change. A genus transfer is just a smaller iteration than a transfer to a higher group.

If a giraffe is transferred from the ruminants to the whales, then I can see this being a major change in what the whale group is but has the giraffe changed? I can see where a single concept might be sorted into different categories by different parties without the concept itself having to be changed.

For example, when David Patterson inserts the Choanoflagellata as a parent for all metazoa in his Union classification, does he really create all new concepts for all the fulgorids?

from general.

ThierryBourgoin commented on May 29, 2024

Hi Dave. Yes I'm also happy to see all this back again... ;-)

Finding a specimen of Giraffa in South Africa wiould surely modify your concept of Giraffa as being more widely distributed. It will not change the name (‘Signifier' of the concept) but its content, ‘Signified' yes. Distribution is part of the attributs of the concept how we understand the taxa.
If a taxon is transferred to another parent taxon, its definition is changed even its circumscription (content) is not changed. There is no one way to define a concept but at the same time by extension and by intension. If Giraffa moves to whales, it acquire all the characteristics proper to whales up to the first commun ancestor of Giraffidae and Whales (Mystic, Cetacea, Whippomorpha). Giraffa concept is completly changed (by intension) having all the successives synapomorphies of these clades. Whales concept is also changed: by intension in incorporating some Giraffa autapomorphies and by extension by incorporating Giraffa.
When Choanoflagellata is inserted as parent of Metazoa, characteristics of Choanoflagellata become part of Metazoa lineage and its children taxa and therefore yes of Fulgorids. Leaving Choanoflagellata as sister to Metazoa excludes these characteristics from the metazoan lineage. Of course in practice we don't document these changes but from a formal and logical way it is, as I suppose it is necessary for an accurate representation of taxonomy and its management in biodiversity bioinformatics.

In fact my point here is that

from general.

ThierryBourgoin commented on May 29, 2024

In fact my point here it that I would like to be sure that we don't have to redone again this exercise later, because the schema we are using to represent taxonomic knowledge is not enough complete. It was not necessary 20 years ago to separate names from taxa...

;-) Th.

from general.

dremsen commented on May 29, 2024

Thierry, Certainly I agree with this last sentiment and so wish to be very careful. We need an identifier system that is tractable and has practical value while at the same time being precise enough to have meaning. My perspective is mainly as a user with a particular set of use cases and as a developer examining and trying to model concepts as presented in monographs and fauna's.

from general.

mdoering commented on May 29, 2024

If there is no use case I don't think we should implement it. Keep things simple. It is not bad to refactor things in a few years, but to create something which is not used in the first place is wrong.

The ever changing identifiers in the CoL have been a huge problem for its uptake, we need something far more stable. And in my opinion (based on use cases from GBIF, Collections, iNaturalist and others) something to hold on to a stable taxon regardless of its name. Such a taxonID paired with a nameID is very powerful and would be a serious game changer

from general.

dremsen commented on May 29, 2024

I saw the update came in and wanted to check in. Where do we stand on taxon concept IDs? I've been giving them a lot f of thought recently. I think there are use cases for them. I think they are tractable. I think we can accommodate Thierry's interest in supporting the classification as a component of them. But, referring to a 180918 comment of Thierry's, a separation of names from taxa, or more specifically, syntax from semantics, is a requirement.

from general.

mdoering commented on May 29, 2024

I talked with Nico Franz about this in Leiden and he is considering to look for funds.
I basically still believe the original idea we had in Woods Hole makes a lot of sense and I would like to implement that next year as an experimental feature. It came up in Leiden as a requirement for many people/projects. E.g. legal documents of the EEA.

The issue about including the classification in it needs discussion, but I am convinced that we should go for two different concept ids in that case. One that includes the classification and one that purely looks at the included set of organisms.

from general.

dremsen commented on May 29, 2024

Agree with you on both counts. I have some rebuttals regarding this really being a component of a concept-by-intension component of the concept but, if they are separate IDs where the classification is distinct from the circumscription, and the circumscription is based on the sets of included protonyms, then I'm right with you. I see this as a requirement for many other types of users too, especially in eco and conservation uses.

from general.

mjy commented on May 29, 2024

I talked with Nico Franz about this in Leiden and he is considering to look for funds.

If Nico can make his engine available I think many questions are resolved.

from general.

dremsen commented on May 29, 2024

I've had good conversations and relations with Nico and worked hard to verify we are in congruence in our views on concepts. I think he's great.

from general.

ThierryBourgoin commented on May 29, 2024

Thanks Dave. Unfortunately I ur paper was rejected last week as one reviewer said in 5 lines (...) that NCBI has already do that and our proposal is not practical ! My only aim with this paper was to alert that separating names from taxa might not be enough to report fully enough in the future taxonomic knowledge ... probably this was not enough clear😕. Any way we are working on a new version with Nicolas, René and Regine (in copy) and we l’ll try to tackle the issue from the iUID perspective with providing some rules when we should be considering having a new taxonomic concept when taxon definition by extension or intention change. But Also in fact, we are even not sure that the taxonomic triplet (name, taxon, classification) is enough for a complete accurate formalisation of a taxon for digital purpose... we are also working on this. BW. Th. / Th. Bourgoin - iPhone / Th. Bourgoin - iPhone

…

> Le 30 oct. 2019 à 18:11, David Remsen ***@***.***> a écrit : I saw the update came in and wanted to check in. Where do we stand on taxon concept IDs? I've been giving them a lot f of thought recently. I think there are use cases for them. I think they are tractable. I think we can accommodate Thierry's interest in supporting the classification as a component of them. But, referring to a 180918 comment of Thierry's, a separation of names from taxa, or more specifically, syntax from semantics, is a requirement. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

from general.

dremsen commented on May 29, 2024

Nico's system relies on articulations that assert concept relations but requires a source for them. In his demonstrations these usually are supplied from an external source. The protonym model provides the means to supply computable articulations to that system.

from general.

mdoering commented on May 29, 2024

Yes, most articulations one should be able to retrieve from the basionym/protonym relations as the proxy for a shared type. BUT these are not necessarily equals relations if you think about aboves image and splits and merges. Also I would think it is relevant to know WHEN the taxonomic was last updated as one can then discard names that have been described since, e.g. if its known that a split occurred later, then the use of that name might be precisely linked to the former sensu latu concept. I am not convinced we can encode all our knowledge into RCC5 relations easily. Or if so, this at least is the difficult part.

from general.

dremsen commented on May 29, 2024

Date is indeed important. But this should be a component of the source properties I should think.

from general.

rdmpage commented on May 29, 2024

Feels like I'm gate crashing this discussion, but blame @mdoering for bring this thread to my attention ;) Frustrated by my attempts to make sense of taxa and taxonomic names in Wikidata (which is fast becoming the broker for taxonomic identifiers, and indeed any sort of identifier) I have been revisiting taxonomic concepts, etc. When will I learn?

Reading this thread I'm overwhelmed by a sense of "here we go again", however I want to suggest an approach that I think would be both doable and create real value for the wider community. @deepreef #6 (comment) teased out eight(!) things that are being discussed, which I basically agree with, except I would junk 7 and 8. That is, I don't think defining taxa intensionally (#8) makes much sense (this is something you compute based on a tree after the fact, it conflates defining something with learning about it), and I don't actually think circumscription is something a taxonomic database is best for (#7), in the sense that the bulk of "circumscription" is happing elsewhere (e.g., iNaturalist users saying this is a photo of "x", DNA barcoders saying this sequence belongs in BIN "y", etc.). Even if a taxonomic database had circumscription, why would iNaturalist or BOLD or even GBIF use those rather than the circumscriptions they generate themselves? We can get higher taxon circumscriptions easily enough from a classification, but the notions such as changing set of species means the genus is somehow a different taxon seems somehow unhelpful. And don't get me started on the bizarre approach the Atlas of Living Australia takes to changing taxon identifiers almost daily.

So, this leaves #5 and #6, namely "protonyms" and "usages" (I'm taking #1 - #4 as essentially given, maybe subject to tweaks).

So, as I sketch out Taxonomic concepts: a possible way forward here, it seems to me that a really useful tool would be something like this:

First, every protonym gets a nice, human-readable identifier, for example a combination of species epithet, author, and year. Whatever it takes to be human readable and unique (the blog post talks about previous efforts at "uninomial" nomenclature, which is the inspiration. Linked to this identifier is every homotypic synonym of that name. This would enable a user, for example, to have a stable identifier for a species that didn't change when the species was moved to a genus. This is essentially #5 (I think). One immediate advantage is that the sort of classification comparison that, say, eBird does, becomes available to all, because there are stable identifiers for species names (and all its variations). it would make Wikidata's life easier as it would need only one of these identifier for each species (regardless of what particular genus and species pair it treats as accepted).

Then imagine that same identifier is linked to every "usage" (name + reference pair) that we consider to be relevant, including heterotypic synonyms. This would enable a user to generate things like the current name and all synonyms, as well as go back and generate a snapshot of what the taxonomy was in, say, 1990. I think this is basically an aggregation of #6, and is close to the notion of a taxon concept being an "according to" statement.

One could imagine an interface (both web and API a bit like):

/n/aus-fred-1909 gives you all homotypic names that share this protonym
/t/aus-fred-1909 gives you all usages, ending with the current viewpoint of the database (in the context of the whiteboard diagram shown above, this is effectively tracing the fate of the type). This would include heterotypic synonyms, in other words, any taxon where this name is relevant.
/t/aus-fred-1909#time gives you the state of play at a given moment in time

Everything else (actual "content" of each taxon, implications for characters of taxa, etc.) are all things one could compute from the classification if you wanted, but I think these are really separate things. And I struggle to see the demand for them globally, as opposed to what may well be intense interest in specific cases.

But I think there is a global need for a stable way to refer to a "taxon", and I think this might be a way forward. It's one step beyond names in that is expressly linked to information about the name and its use, but it's relaxed enough for someone to be able to just link to an identifier without having to determine if the "concept" exactly aligns. It avoids what feels like a black hole of defining taxa by extension or intension.

If, for example, the identifiers were DOIs, clean and human readable, I imagine this could be enormously useful, and solve genuine and tractable problems.

from general.

dremsen commented on May 29, 2024

Happy to read Rod’s post. The protonym model is the way to model concepts. I’ve argued for this for too many years. It was the basis for the uBio data model. It separates syntax from semantics, providing an objective basis for defining computable taxon concepts. This separation is critical and remains a fundamental problem for the long term viability of the CoL because one cannot mint taxon identifiers without it. A list of the world species without the means to properly provide species identifiers is a problem. In uBio we had NameBank which grouped strings into lexicons into names into protonym groups. In ClassificationBank taxa were (implicitly) groups of protonyms. Different treatments of the same name could be compared by their protonym array. This is how taxa are represented within annotated catalogs and treatments. Circumscriptions via specimens or literature always are tied to a name that are tied to a protonym or a treatment (asserting a taxon inclusive of a set of protonyms). The structure was there, I just didn’t have all the data properly mapped. Until we have a system that cleanly separates names from concepts (i.e., syntax from symantics) we don’t have the right system. When we do we can properly catalog objective synonyms independently from subjective synonymic assertions, we can acquire a useful objective dataset that we don’t have to toss every time new evidence changes taxon or pulls a GSD, etc. and we can enable an inclusive and applied taxonomic infrastructure that doesn’t artificially cover up the natural flux and ferment that is taxonomy. We can also support the more granular and refined taxonomic use cases required by the Nico Franz’s of the world. The only question for the COL should be whether we are in this space once and for all or just waiting for someone else to do it. Until then, the job isn’t done.

from general.

mdoering commented on May 29, 2024

Thanks @rdmpage. Your design following just the protonym/type was I had initially hoped would solve it for us too. But I think this is flawed for really important use cases. We want stable taxon ids to track splits and merges so that an occurrence of species A. bus sensu 1960 on the whiteboard is not confused with A. bus sensu 1970. These are two different taxa with the 1970 one being a subset of the 1960 one. There are 5 concepts in those 6 usages in the diagram which I would really like to attach 5 different ids to. Thats what Avibase does.

from general.

dremsen commented on May 29, 2024

Is there a place where that is modeled (or laid out) so we can look at those cases? To which diagram are you referring?

from general.

mdoering commented on May 29, 2024

@dremsen The picture of the whiteboard at the top of the github discussion you are on, Dave

from general.

dremsen commented on May 29, 2024

Sorry Was using email not GitHub. moving now

from general.

dremsen commented on May 29, 2024

Would Avibase model that with 5 concept IDs or 6 concept IDs?

from general.

dremsen commented on May 29, 2024

Is there a previous discussion on modeling splits?

from general.

rdmpage commented on May 29, 2024

Thanks @rdmpage. Your design following just the protonym/type was I had initially hoped would solve it for us too. But I think this is flawed for really important use cases. We want stable taxon ids to track splits and merges so that an occurrence of species A. bus sensu 1960 on the whiteboard is not confused with A. bus sensu 1970. These are two different taxa with the 1970 one being a subset of the 1960 one. There are 5 concepts in those 6 usages in the diagram which I would really like to attach 5 different ids to. Thats what Avibase does.

@mdoering There's no reason why this can't be included. In the same was you could, say, append a time stamp to the t/id you could imagine doing the same for a specific usage so you would have an id for a specific usage if you wanted. As an analogy, imagine a web page showing the history of a name and each usage (name + reference) has a fragment identifier, e.g. #1970. The idea of suffix identifiers comes from ARKs which I don't particularly like as an identifier but they do support suffixes (could also mint DOIs with suffixes). Whatever the implementation I think you can have what you seek. We could regard identifiers as hierarchical. By default you get the original name /n/, if the system has a list of usages then /t/ gives you that, and /t/xxx#1970 gets you a specific usage. I guess I envisage some sort of graceful degradation where you always get something.

from general.

mdoering commented on May 29, 2024

The seagull Larus argentatus got split into Larus argentatus and Larus armenicus.
There are 3 ids in iNaturalist for them, one for Larus argentatus s.s. and Larus argentatus s.l.:

https://www.inaturalist.org/taxon_changes?taxon_id=204533

Both Larus argentatus taxa share the same name thus surely also the same protonym.
Avibase might even have more concepts, but I dont immediately understand that webpage:
https://avibase.bsc-eoc.org/species.jsp?lang=EN&avibaseid=F002188E226DF09C

from general.

mdoering commented on May 29, 2024

@rdmpage I was thinking similar. Like in the Plazi timeline you nail down the concept by the timestamp. But concepts exist also in parallel and do not follow a sequential timeline.

Also the main problem here for me is how do I know when looking at the data that the concept has changed? Ignoring identifiers completely as we want to (re)assign them in the process. What warrants a taxon change and when does it remain the same? We need to find objective rules what the algorithm has to do. It also gives a real meaning to CoL ids as they are based on objective rules.

from general.

mjy commented on May 29, 2024

For the record we use the label "Protonym" to refer to a nomenclatural concept, and OTU to re refer to a biological concept. Taxa/OTU (biological things) are not Protonyms in the example below. Talking about biological things being Protonyms seems inherently confusing to me.

Given that, do this:

Link your biological data to OTUs (anonymous entities linked to nomenclature)
Manage your nomenclatural concepts in separate graph (facts)
If you want to model logical assertions between biological concepts then include Nico Franzs graph of object properties/relationship types between OTUS (a different graph, keep the heck away from the nomenclatural graph)
Stack citations, however many you want on any concept (e.g. OTU, Protonym, Franz graph relationship, relationship between OTUs, relationship between Protonyms, etc.). This is your timestamp proxy.
Stack identifiers, however many you want, of whatever type you want on any of the concepts (see above). You, as a curator are making the calls as to whether your concepts align.

For what it's worth we have 100s of thousands of taxon names, OTUs, specimens, citations, and identifiers following this approach in TaxonWorks, i.e. it's not an imagined approach.

from general.

dremsen commented on May 29, 2024

My recollection of the AviBase model (which could be wrong) was that everything got a distinct taxon id (even if their 'computable' circumscriptions were identical). Subsequent articulations would establish they were congruent.

from general.

rdmpage commented on May 29, 2024

@rdmpage I was thinking similar. Like in the Plazi timeline you nail down the concept by the timestamp. But concepts exist also in parallel and do not follow a sequential timeline.

Also the main problem here for me is how do I know when looking at the data that the concept has changed? Ignoring identifiers completely as we want to (re)assign them in the process. What warrants a taxon change and when does it remain the same? We need to find objective rules what the algorithm has to do. It also gives a real meaning to CoL ids as they are based on objective rules.

@mdoering I think "when does it change and when is it the same?" leads to madness. And it's separate to the identifier issue, in that at one level every taxon that includes a given protonym would have the same identifier. Every usage gets an identifier, so the question to ask is not "is this the same as that?" it's what usage (if any) do I want to refer to? I don't know of any particularly useful way to say whether a taxon is the same or not (that doesn't quickly lead to absurdity) but you can ask whether the taxa share types. I guess I'm arguing that any approach that asks either "what is a taxon" or "when are two taxa the same" is digging a hole for itself.

from general.

mdoering commented on May 29, 2024

Usages are much more friendly to work with if you think of publications. Then they are immutable and you have a limited number of them per name. But for dynamic databases it becomes a different problem. The system has to decide when to create new identifiers. In this sense every monthly release of the CoL would be a different usage and has its own set of taxon ids. I doubt this is very useful to anyone. It is much like ALA and what CoL used to do. Identifiers change all the time. Names are actually more stable.

But what we want are more stable ids than the name ids. To keep the taxon identifier the same if the concept is still the same regardless of its accepted name. But that requires either a human to do an assertion or a machine to compare taxa for equality. There is no way we can get human assertions for million of taxa every month. And they would also be very subjective and the rules applied to judge would differ a lot.

My goal is to start with the type specimen as the anchor for a taxon, but refine that for splits and merges by comparing adjacent taxa and their types. If you have a globally complete taxonomy and compare several versions of it (1960/70/80 in aboves example) a missing protonym for A. fus tells you the A.fus you are dealing with is from 1960. And the presence of A.bus as a pro parte synonym in 1970 for 'A.fus' tells us its a split. So we know (1) is the union of concepts (2) and (3).

The goal is to create stable taxon ids as anchor points to link identifications to. The current name can then happily change and if a split or merge happens the id will change and the identification is still referring to the old broader or narrower concept.

from general.

mdoering commented on May 29, 2024

Well, the base for the taxon is the set of types. In the diagram this already identifies 4 out of 5 concepts. The A.bus s.l. vs s.s. is not covered and they would both fall into the same id. But maybe that is still a good start for CoL. It's simple, straight forward to implement and is definitely a step forward from name based identifiers which most systems have including the current CoL and GBIF. And its easy to communicate and most importantly it can be derived from the data

from general.

dremsen commented on May 29, 2024

Markus, thanks for those clear statements. This is the direction I also favor.

from general.

dremsen commented on May 29, 2024

same

from general.

mjy commented on May 29, 2024

Can anyone answer whether a taxon concept sensu the title of this issue has biological attributes?

I.e. do you imagine that following triples are part of the system:

my_taxon_concept_uri has_some height
my_taxon_concept_uri has_color purple
my_taxon_concept_uri eats snails

from general.

mdoering commented on May 29, 2024

No they clearly won't. No traits and description based circumscriptions are planned to be in CoL.
And when I write about types we can manage type specimens, but I doubt we ever list them for all species. So using the protonym as a type proxy is what will be done.

from general.

rdmpage commented on May 29, 2024

@mjy

Can anyone answer whether a taxon concept sensu the title of this issue has biological attributes?

I.e. do you imagine that following triples are part of the system:

my_taxon_concept_uri has_some height
my_taxon_concept_uri has_color purple
my_taxon_concept_uri eats snails

I can't answer for the thread, but I only got into this now because this is the issue that arises in Wikidata. People are adding attributes like these to Wikidata "taxa" when it seems clear that many such "taxa" are names not taxa (in the sense that homotypic synonyms may have their own Wikidata items, so clearly "taxa" aren't always "taxa").

So I guess where you are going with this is what do we hang attributes on? I was imagining that, again, we could have a hierarchy of identifiers. You could just hang it on the name, or the protonym (in the sense of the complete set of usages of the name), or on a specific usage (equivalent to "sensu Fred").

from general.

rdmpage commented on May 29, 2024

@mdoering

Well, the base for the taxon is the set of types. In the diagram this already identifies 4 out of 5 concepts. The A.bus s.l. vs s.s. is not covered and they would both fall into the same id. But maybe that is still a good start for CoL. It's simple, straight forward to implement and is definitely a step forward from name based identifiers which most systems have including the current CoL and GBIF. And its easy to communicate and most importantly it can be derived from the data.

I wonder if part of the problem is the notion of "concept" and that each box in the diagram needs its own identifier of the same "class". Put another way, I would have three "paths" or timelines, one for each type. Three "protonym" identifiers, one for each. Each identifier points to the entire history of each type , and events along the way are marked on those timelines. Each one of those events gets an identifier ("usage"). So you can still refer to A.bus s.l. or A.bus s.s by referring to a given usage. Now, some of these paths will intellect in the sense that someone may say that these two things are heterotypic synonyms, so the graph would need the option of having an edge between two paths (I think this is essentially what the Australian NSL does in their model).

I'm still waving my arms around here (can you tell? ;) ) but I do wonder if part of the problem is seeing things as boxes rather than as timelines.

from general.

rdmpage commented on May 29, 2024

@mdoering

Usages are much more friendly to work with if you think of publications. Then they are immutable and you have a limited number of them per name. But for dynamic databases it becomes a different problem. The system has to decide when to create new identifiers. In this sense every monthly release of the CoL would be a different usage and has its own set of taxon ids. I doubt this is very useful to anyone. It is much like ALA and what CoL used to do. Identifiers change all the time. Names are actually more stable.

Argh, no! Firstly, why change everything, why not simply add new ids for those things that have been updated? In other words, you're doing a diff and saying these three records changed, everything else is the same. Secondly, these things would (if they are changes to existing records) simply be a new usage along the path for that "protonym". The protonym itself would be stable. Someone linking at that level of resolution (e.g., "I don't care about the details, it's Drosopholia melanogaster as far as I'm concerned") wouldn't be affected.

Put another way, we're talking about versioning, and it seems to me the only sensible way to deal with that is enable people who don't care about versions (which, I'd guess, will be 99% of people) to have something stable to point to (which is more or less what names have done for us), and give those who do care something they can point to.

from general.

mdoering commented on May 29, 2024

I was imagining that, again, we could have a hierarchy of identifiers. You could just hang it on the name, or the protonym (in the sense of the complete set of usages of the name), or on a specific usage (equivalent to "sensu Fred").

Yes, we will have different ids for a name and a usage. But we want to keep the CoL usage ids stable across versions (currently monthly) so we need to figure out whats considered still the same taxonomic concept. For names it's simpler, but even there it is not obvious as it still requires a clear definition of a ScientificName.

The current version of CoL only generates stable ids for names.

from general.

mdoering commented on May 29, 2024

Argh, no! Firstly, why change everything, why not simply add new ids for those things that have been updated? In other words, you're doing a diff and saying these three records changed, everything else is the same.

That raises again the question which properties exactly belong to a usage. We are dealing with a graph where everything is connected. What are the boundaries of the object to version? Is the entire classification included or just the parentID as in our model? What about children? What about the structured reference of the publication? Every single distribution record, vernacular name etc does have impact too? If thats all in, you easily end up to have new versions all the time.

Secondly, these things would (if they are changes to existing records) simply be a new usage along the path for that "protonym". The protonym itself would be stable.

Are you sure the protonym is stable? It is in the theory and because it has been published on paper. But in a database world? See above for a usages boundary. Adding a vernacular name could break it. I guess it would be rather the protonyms name id.

Put another way, we're talking about versioning, and it seems to me the only sensible way to deal with that is enable people who don't care about versions (which, I'd guess, will be 99% of people) to have something stable to point to (which is more or less what names have done for us), and give those who do care something they can point to.

Yes

from general.

rdmpage commented on May 29, 2024

@mdoering I haven’t kept up with CoL’s data structure, but naively I would have said that the “concept” is the latest name + reference combination (e.g., A. bus + DOI:10.1234/xyz) and if there’s not a more recent usage then the id for the latest “concept” would be unchanged). I put “concept” in quotes because it seems that everyone has a different idea of what that is. It’s also not clear to me who the indeed users are, and what their expectations would be. Clarifying that presumably would affect what identifiers to expose.

…

On 22 Aug 2020, at 09:59, Markus Döring ***@***.***> wrote: I was imagining that, again, we could have a hierarchy of identifiers. You could just hang it on the name, or the protonym (in the sense of the complete set of usages of the name), or on a specific usage (equivalent to "sensu Fred"). Yes, we will have different ids for a name and a usage. But we want to keep the CoL usage ids stable across versions (currently monthly) so we need to figure out whats considered still the same taxonomic concept. For names it's simpler, but even there it is not obvious as it still requires a clear definition of a ScientificName <#35>. The current version of CoL only generates stable ids for names. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAUK2RBUHGP2A7QJUQX7SDSB6CIFANCNFSM4DKBXVWA>.

from general.

rdmpage commented on May 29, 2024

@mdoering

Argh, no! Firstly, why change everything, why not simply add new ids for those things that have been updated? In other words, you're doing a diff and saying these three records changed, everything else is the same. That raises again the question which properties exactly belong to a usage. We are dealing with a graph where everything is connected. What are the boundaries of the object to version? Is the entire classification included or just the parentID as in our model? What about children? What about the structured reference of the publication? Every single distribution record, vernacular name etc does have impact too? If thats all in, you easily end up to have new versions all the time.

I think this is a more general question about versioning graphs, and there’s a literature on that. Naively, I think in terms of edits between graphs, especially as this seems to capture the way taxonomists describe their work (e.g., “we created a new genus, and species x and y are transferred there” is essentially an edit script for transforming one graph into another). The other things you describe (publications, distribution records - really?) can ether be treated as separate nodes, or as metadata (I gather there are ways to version graphs that treat node properties separately from nodes). But just because everything is connected doesn’t mean you can’t isolate changes in pretty much the same way you can do a diff on text to isolate the insert/delete/move events.

Secondly, these things would (if they are changes to existing records) simply be a new usage along the path for that "protonym". The protonym itself would be stable. Are you sure the protonym is stable? It is in the theory and because it has been published on paper. But in a database world? See above for a usages boundary. Adding a vernacular name could break it. I guess it would be rather the protonyms name id. Put another way, we're talking about versioning, and it seems to me the only sensible way to deal with that is enable people who don't care about versions (which, I'd guess, will be 99% of people) to have something stable to point to (which is more or less what names have done for us), and give those who do care something they can point to. Yes

I guess I imagine a list where the protonym is the head of the list, and you append usages (name + reference) of any name connected to the type to that list. This list, combined with lists for other protonyms will form a graph. Snapshots at a give time will be a classification. And again, I think you can separate node properties from nodes themselves. Yes, this is all a bit arm wavvy, but I’ve built trees for different versions of the Clements Checklist of birds using eBird ids for species, and you can clearly isolate the changes by comparing the trees using a tree-based diff. Incidentally this is only possible because eBird keeps stable ids for species independently of the current name. I’m currently trying to do the same with the [reptile database](http://reptile-database.reptarium.cz) where, mercifully, there are internal integer ids for species that remain stable even if the name changes. I guess I see taxonomy as essentially a series of distributed edits on a graph, and if we capture those then we have basically captured what taxonomists generate in their work.

from general.

rdmpage commented on May 29, 2024

As an aside, here's a screenshot of a comparison between two recent classifications of snakes from the Reptile database. I use specific epithet plus internal integer id to identify each species (the id doesn't change for the species). This particular difference shows moving species from one genus to another (a move operation), taxa in light gray haven't changed. There's some added complexity that means the genus name itself has to change, but hopefully you get the idea. So I would regard things like "yunnanensis-18606" to be "prototnym-style" identifiers that are linked to the complete fate of names attached to the type for that name, and for which you could recover that this species has moved from Sinonatrix to Trimerodytes (and probably other moves if we go back further in time).

Personally I would postpone any discussion of whether a taxon is the "same", because I don't think there's a unique answer (what means "same"?). But if you have the history then you enable people to determine "same" or not for their definition of "same".

What I also like is we can attach a publication to this particular edit operation (i.e., the research that lead to the move), so it is linked to evidence, and to the people who did the work.

from general.

mdoering commented on May 29, 2024

Having a connected graph doesn't make things impossible, but it needs a definition or specification of what we want to version. Surely you can ignore vernacular names, distributions and the ancestral classification. We just need to agree on what is relevant.

And for CoL we want to provide one definition of a concept. Even if it is not universal and there will be legitimate reasons to have others ids, I strongly believe it is very useful to have a global taxonomy with some sort of stable taxon identifiers that people can hang things like identifications on as long as the definition of what the id refers to is clear to everyone. And thus when it changes or not.

Having a stable protonym anchor is nice, but its too unprecise for many purposes (splits/merges).
And having a long list of usages again is inflationary as long as we do not have concept relations between them.
My goal is to provide identifiers for linking information that lie in between the two and are more stable than name based ones.

from general.

mjy commented on May 29, 2024

@mdoering where I'm going with this is that if you (or the CoL team) can't answer that question concretely/precisely then you will never get further towards answering this question.

Even if you can answer this question concretely I'm around 99% positive that you can not do what you want to do given the data you are given by the GSDs (no biological data, most without OTU ids). You have been tasked to do the impossible by the CoL. I'm serious. How, possibly, can you provide something more stable than the incoming data if those data don't have the requisite stability in the first place? This seems so obvious it's frustrating. We get data without the needed facts coming in, we mix it up, and VOILA BETTER FACTS. HUH!?

IMO @rdmpage is nailing all the key AHAs:

Don't think of the system as changing in the sense of editing one new node and turning that node into another, think of it as accumulating facts over time, always adding new nodes. Names don't become valid, or invalid. They are both, at different times. Name's don't split or merge, they are used in different ways with reference to a citation.
As Rod alludes you can not, as I mention above, figure out sameness of a biological taxon, unless a) it is asserted with OTU ids by the provider (nobody does this) or you have an algorithm that computes across biological data. You certainly should not go there (again, what you specifically are tasked with for GSDs is an impossible task).
Nobody has implemented Franz' system to scale, which is the what you'd need if you wanted to discus biological merges etc. Most people use the proxy of cited history of protonyms with a 1:1 basis of biological entities. However for merging across different datasets this does not hold. I.e. within curated assertions by curator(s) they by proxy to the nomenclature relationships map to biological concepts, but outside that set of curated assertions that proxy doesn't hold. IMO Rod is absolutely correct that taking the step to manage merges (using Nico's system or others) is a HUGE amount of hard work that almost nobody will do unless we have radically new software, even then, unlikely.
Protonyms (monomials) need IDs, and in our world (TW) they are stable, and in the world you want they need to be stable IMO. This is not "arm wavy"- it's exactly what we do in TW. They need to be citable, and their combinations need to be citable, and assertions between protonyms need to be citable. Being citable means linking them to a timeline by proxy of the year of publication linked in the Citation. So too do OTU ids need to be stable. Without curation workbenches that do this for you, you are, like I mentioned above, up *@$ creek. Even if those workbenches do it for you you still have issues with merging across GSDs that I suspect can not be resolved without specific new assertions or fun biological computation.

from general.

mjy commented on May 29, 2024

Perhaps this belongs in more of a blog format, but the coffee is flowing, so I'll post it here.

Reading back @rdmpage says " I was imagining that, again, we could have a hierarchy of identifiers.". I think we agree, but I look at it from a different angle. You don't even need a hierarchy bit (at the core), you just need an anonymous ID for stability. In WikiData the Q1234 is just fine, or maybe B12354 (where "B" is biological concept). Mint it, and surround it with facts. Any hierarchy can be added as an assertion, but it's not a central organizing principle. There, you have stability in as much as WikiData is stable, around which a concept can grow. The bonus- the WikiData identifier is resolvable, and the data there collapse down to computable statements. That concept has relationships to names, biological data, other "B"s. That's it. If people reference the Q12345, the concept will strengthen, if they cross-reference it to another system of identifiers, it will further strengthen. There is nothing magic here.

IMO things to avoid if you want to make it better:

Don't bother trying to enforce (or even espouse) one identifier per taxon concept. Plurality is reality.
However, attempt to bias the use of one identifier per concept, by doing good science. Build the strength of a concept (in the broad sense) by giving it rich context. With rich context comes eyes. With eyes come improvements to the data. With improvements to the data comes -> "That QID is good enough for me, I'll use it in my workbench, because it seems useful". Now that it's in my workbench (which also references many other identifiers the curator is interested in, but who cares), I've created a richer context. Iterative/cyclical improvements.
When a new Q concept emerges that seems to be a biological taxon quickly add attach biological data to bias that concept to being thought of as biological, rather than some weird Frankenstein of biology and nomenclature.
Be OK with deprecating Qs, but do that at an external organization level. Imaging Q1 -> A gull. G2 -> A gull almost identical to Q1, some think it is, some think it isn't. Both can exist, that's fine. An external org/agent etc. makes a decision to reference one or the other, or mint a 3rd if they want. When an agent/org mints a new list, they reference Qs, that reference builds a set of data that is returned to WikiData. Now we have richer context (everyone is using Q1, but almost nobody references Q2). We can ask why, etc., and refine or mint new Qs, or we can just drink the cool-aid and accept Q1, because everyone else is doing it, and we trust them. IMO the only way out of this approach and its problems is to compute on the facts (QS attached to QS that are specimens, not Qs that seem to be biological taxa).
Never, ever, ever embed information in the identifier, even prefix "B" (biological concept) over "Q" (thing) is likely a bad idea. Years in the ID? Terrible. Hierarchy or nestedness? Ugh. Relationship belong in object properties (links between instances). Identifiers with biological names in them are the absolute worst. People have to internalize that identifiers point to concepts which are the nucleus around which data accumulate, nothing more. Note that WikiData uses Q, and a couple other prefixes, that's it. That should be a big hint to those thinking about identifiers. This approach will only be learnt by teaching upcoming generations of students/workers.

I love the idea of seeing WikiData IDs seep into all the nooks and crannies, they are so simple. We just have to build the practical interfaces to it such that curators/taxonomists/scientists can draw from those IDs, and integrate them into concepts they work with on a day-to-day basis.

Getting back to CoL. What could be done?

As a matter of policy, encourage, slowly, but ultimately more forcefully, GSD providers to provide OTU ids. The CoL is after all a list of biological species. If GSDs can't assert the circumscription of biological entities, but rather just nomenclatural relationships, they aren't doing their job.
As a matter of policy, encourage, slowly, but ultimately more forcefully, that those GSD OTU ids be WikiData Q numbers.
As a matter of policy, reject the concept that we will always have data from providers that do not have OTU ids. It is not OK for the CoL, IMO, to accept that some providers will just provide Word documents until the end of time. Make it an educational policy, with support from the CoL, to get the data out of those formats, and into one of the many possible better alternatives. I feel, to date, there is far to much complacency in this regard.

IMO any other effort by the CoL is treating the illness, rather than going for the cure.

from general.

mdoering commented on May 29, 2024

@mjy we want an algorithm that computes concept equality on the basis of stable name ids and the homo- and heterotypic synonymy given by a GSD. TW is very different that it is an editorial system. Versioning is simple when you can intercept record based changes. But imagine every change is done by bulk uploading thousands of taxa and names. You need to figure out what has changed and if it's a relevant change unless you want to version each and every record all the time.

You can rely on stable ids from outside (WikiData, IPNI, Avibase, GSD IDs such as in WoRMS or TW, you name it) that the GSDs (re)use and then blindly trust them. But this is wishful thinking right now and we would have to drop large parts of the catalogue. The CoL is an established project that we need to continue. Even if we trusted ids from the outside they would not follow the same rules and be very different in what they mean. The CoL is an aggregation of heterogeneous sources.

Thats why we decided to issue our own CoL ids (as CoL always did), based on some computable algorithm. The Taxon ID discussed here is something we have not started with, so details will only come up once we do so next year.

And really it's the same for name ids. Does any change to the record generate a new id or do we attach ids to the idea of a published name (usage) that is fixed, but for which we can change the name records "metadata".

from general.

mdoering commented on May 29, 2024

And like I said above: The basic version of such an algorithm would just look at the set of types included in the synonymy to define the concept. And in the absence of good type coverage the protonyms will be used as type proxies. Such ids might not be perfect, but have a clear definition, are stable and an improvement over pure name ids (which we also have as a different way to link to CoL).

from general.

mjy commented on May 29, 2024

@mdoering "then, blindly trust them. But this is wishful thinking right now" - So no identifier is good enough, so you'll mint your own, based on data that contains "identifiers/names" that are not good enough, and some algorithm that pulls new facts out of the air. Then, on top of that you are then asking others to trust your new identifiers and the decisions that come from them... but not those other ones. I see no problems there ;).

You're still thinking of "changes". There is no versioning, it's only accumulated facts, that's the principle TW uses. This is precisely the core of a data model CoL needs ultimately. How it populates that model is the real tricky bit (thus this issue). My argument, and I"ll drop it, is that you can't get much farther than you do right now unless providers improve their data.

What do you mean by type? Specimens? Type specimens don't define biological concepts, that is a old, well known fallacy. Type specimens anchor name priority, that's it. It's a different edge in the model (Specimen -> Name, it has nothing to do with Specimen -> OTU/Biological concept). Overloading their meaning will lead to nothing but pain in the long run ;).

from general.

deepreef commented on May 29, 2024

All: I woke up this morning to an inbox full of really interesting and exciting posts within this thread. You all know me well enough to know that I cannot remain silent. So the only practical option for me was to read the thread in sequence, and comment accordingly. Apologies in advance for re-stating points already made (think of them as "+1"s), and for the length.

@rdmpage :

except I would junk 7 and 8.

That is basically the same conclusion I've come to over these past couple of years. It may eventually be possible to develop these areas (identifiers/classes for circumscriptions and broader "concepts"), and/or maybe other groups are better suited to pursue it than the usual gang of suspects (myself included) that keep repeating these conversations across many years. I think they have potential value, and I wouldn't shut the door on them completely. But I think we need to walk before we can run, and at the moment we're (still!) in the transition stage between crawling and walking. There have been a bunch of conversations along these lines in recent months among the tdwg/tnc group.

So, this leaves #5 and #6, namely "protonyms" and "usages" (I'm taking #1 - #4 as essentially given, maybe subject to tweaks).

Yup. Same here. Reaching back to the language I used in that in related comments, Protonyms are the content, and Usages are the context. Both are the same class of "Thing", because both have the exact same properties. However, distinguishing Protonyms (as a subset or subclass of of all Usages; see tdwg/tnc discussion) is useful not because they represent a distinct "thing", but because they can serve in a special-case (and fundamentally important) kind of relationship with other Usages. This solves the issue raised in your recent iPhylo post. But please don't use the word "species" in this context (i.e., "...the importance of stable identifiers for species...", etc.). For every ten people who read your post, there will be 12 different ideas about what that word means in this context.

First, every protonym gets a nice, human-readable identifier, for example a combination of species epithet, author, and year.

Sure. Aus bus Linnaeus 1758. If you want to be really consistent and unambiguous and explicit, you would structure that identifier as Aus bus Linnaeus 1758 sec. Linneaus 1758. There are pros and cons to qualifying protonyms that way, which I'd be happy to elaborate on in another post, if asked. In either case, we should call it a "canonical name-string" or something like that, so that it's immune to spelling variants, qualifiers, abbreviations, etc. that might have been represented on the actual page within Linneaus 1758. But please, PLEASE don't assume that our electronic database systems will use these same human-friendly identifiers for internal identification purposes (e.g., foreign keys, or even urls). That would be a really bad mistake (see below).

Linked to this identifier is every homotypic synonym of that name ... This is essentially #5 (I think).

Yup. Exactly. See: 10.5281/zenodo.59790

Then imagine that same identifier is linked to every "usage" (name + reference pair) that we consider to be relevant, including heterotypic synonyms. This would enable a user to generate things like the current name and all synonyms, as well as go back and generate a snapshot of what the taxonomy was in, say, 1990. I think this is basically an aggregation of #6, and is close to the notion of a taxon concept being an "according to" statement.

WOW! FINALLY! Do you have any idea how long I've been waiting for someone else to write something like that? Seriously... THANK YOU!

One could imagine an interface (both web and API a bit like): ... /n/aus-fred-1909

Ugh. OK, well I can certainly imagine a service that takes those three parameters (epithet name, author, year) and finds how many matches there are. If only one match, it could function as an identifier and provide the relevent record. But based on content already in GNUB (202K Protonyms initially established as full species), about 7,000 (~3.4%) are non-unique across these three property values (original epithet orthography, authorship string, year). Granted, that's a small percentage -- but even 96.6% unique is pretty pathetic in the realm of "unique identifiers". (Fun fact: the author Malm described 24 different species with the name "linnei" in 1877; per ZooBank).

As I've pointed out many times, the amount of complexity needed to come up with an identifier for this sort of thing that is both human-friendly and unique vastly exceeds the complexity of having opaque identifiers (e.g., UUIDs) that are used by the computer for true identification, and then simply renders the results back to humans with a human-friendly label.

But that aside, yes -- we've already built and tested services of the sort you described. But the funding ran out before we were in a position to turn them into accessible APIs. That circumstance is changing (rapidly), so we may get these APIs up and running after all. Watch this space.

Everything else (actual "content" of each taxon, implications for characters of taxa, etc.) are all things one could compute from the classification if you wanted, but I think these are really separate things.

I absolutely, 100% agree!

If, for example, the identifiers were DOIs, clean and human readable

I know you love human-friendly identifiers, and I get that. But life is SO much easier if you have computer-friendly identifiers, then represent them via human-friendly labels whenever human eyeballs are in play. DOIs are WONDERFUL because of the rich dereferencing/resolution services. But they suffer the same fate as PURLs and other similar sorts of identifiers in that they conflate identification with dereferencing/resolution mechanisms. The best of all worlds can be achieved when you mint UUIDs as identifiers, then wrap them in a DOI prefix (making them dereferencable/resolvable), and then create a standard format for constructing a human-friendly label. The PLAZI/Zenodo team almost gets it right, in that they issue UUIDs to Usages (=Treatments), then Zenodo mints DOIs for them. Unfortunately, Zenodo doesn't embed the UUID within the DOI, so we have yet another identifier to track. For example: http://treatment.plazi.org/id/03EA878F-FF95-FFA5-4F81-1B00FB0E6CA9 sameAs http://doi.org/10.5281/zenodo.3806768

Sigh....so close....

@mdoering :

There are 5 concepts in those 6 usages in the diagram which I would really like to attach 5 different ids to.

I believe I recognize the handwriting/chicken-scratch in the whiteboard diagram as my own (and I certainly remember the animated discussion). The problem is not that we don't have enough identifiers. The problem is that we have too many. All of the boxes on that diagram get a separate TNU identifier. The problem is that there are dozens/hundreds/thousands of potentially relevant other TNUs for congruent circumscriptions, each with their own identifiers, and it's not clear which one to use. For example, there may be many publications that all represent the same set of heterotypic protonyms as shown in box 5 of the whiteboard diagram. Which one becomes the "identifier" for that box? The chronologically first to synonymize A. xus under A. bus? The one who provided the most robust taxonomic treatment? What people seem to want is a single identifier for "Box 5", which may serve as hub for dozens of individual TNUs all asserting the same/congruent taxon. This is where "TNU as surrogate for concept/circumscription" gets messy, and requires a third party to "elevate" one of the TNUs as the surrogate/proxy for the "box". But in any case, the problem isn't a lack of identifiers to use for concepts. The problem is an overabundance of them.

Also the main problem here for me is how do I know when looking at the data that the concept has changed? Ignoring identifiers completely as we want to (re)assign them in the process. What warrants a taxon change and when does it remain the same? We need to find objective rules what the algorithm has to do. It also gives a real meaning to CoL ids as they are based on objective rules.

Here is where I think we keep getting hung up. In order for "a concept" to "change", we need to come to some agreement as to what "a concept" is. How can you know whether it has "changed" if you don't even agree on what it is? Walter Berendsohn used the term "Potential Taxon", for what I called "Assertion", and which we now refer to as [Taxonomic Name] Usages. Every TNU represents a potentially different taxon (concept/circumscription). But depending on how one defines "taxon" (i.e., my #7, which both @rdmpage and I have decided is not tractable - at least not at this time), different people would use different mappings of which individual TNU instances map to which individual "taxa". So to say that "a concept" has "changed", we first need a definition for what "a concept" is, and even after we achieve that, it's often the case that insufficient information exists (within the publications, within our databases) to even know if the concept has changed. In theory, this would be wonderful. In practice, it's going to be a while before it can be meaningfully implemented. I think @nfranz understands this realm far better than anyone else, so I would defer to him on that point -- but the sort of stuff he has done explores the potential/power/limitations of this space. Personally, I find it both exciting and scary at the same time.

Reading further down the thread, I think @rdmpage nailed it with:

I think "when does it change and when is it the same?" leads to madness.

He also nailed it with this:

Every usage gets an identifier, so the question to ask is not "is this the same as that?" it's what usage (if any) do I want to refer to?

+1 (is it possible to add a "+5"?)

@mjy:

Link your biological data to OTUs (anonymous entities linked to nomenclature)

I would say "Link your biological data to TNUs" (each of which represents an explicitly defined or implicit OTU). Are we saying essentially the same thing? The nice thing about doing it through TNUs is that's often how it happens in the real world. Someone has an organism in-hand (biological data), and assigns it to a name by referring to some (usually published) definition of the name (field guide, key, etc.). The exceptions are the expert taxonomists who just "know" what species it is. But in such cases, they simply need to point to a TNU that represents the taxon in the same way they "know" it to be.

Stack citations, however many you want on any concept (e.g. OTU, Protonym, Franz graph relationship, relationship between OTUs, relationship between Protonyms, etc.). This is your timestamp proxy.

OK, so maybe we're not the same. I've recently had very long discussions with Kevin Thiele about exactly this issue (we even refer to it as "stacks" of TNUs aligned on a single "concept"/"circumscription" instance). But see my comment to @mdoering above: coming up with a shared definition for what these name-less taxon entities are, is the real barrier.

Flesh-and-blood-and-celluslose-and-cytoplasm Organisms exist in nature. Taxa do not. Taxa exist in the minds of humans. Humans communicate information about taxa (and the mappings between their imagined taxa and actual organisms) via text-string names usually embedded within publications (or other references). The text-string names are usually what get indexed in databases. But the name-in-context (e.g., "Aus bus Linneaus 1758 sensu Pyle 2020"; AKA a TNU) is the most effective and practical way to reference the interface between names, organisms and OTUs/taxa.

For what it's worth we have 100s of thousands of taxon names, OTUs, specimens, citations, and identifiers following this approach in TaxonWorks, i.e. it's not an imagined approach.

Substitute "GNUB" for "TaxonWorks", and I can make exactly the same assertion (and more than just specimens -- in fact, most of the organism occurrence instances are observations).

Back to @mdoering:

Usages are much more friendly to work with if you think of publications. Then they are immutable and you have a limited number of them per name. But for dynamic databases it becomes a different problem.

Yes, which is why I separate out the static TNUs from the dynamic Meta-Authority assertions. See, again, this publication, page 34, starting with the heading "Accepted status".

The system has to decide when to create new identifiers. In this sense every monthly release of the CoL would be a different usage and has its own set of taxon ids.

Not necessarily. Even if you can't stomach the Meta-Authority approach (where a new identifier is needed only when a particular perspective changes), you can just only issue a new identifier when it changes in substance (different synonymy, different classification, change in circumscription, etc.; more detail below) from one month to the next. Effectively each month's cut becomes a change log. The cut can include the full dataset, but the identifiers only change when the relevant content changes. You still need to define what properties within CoL warrant a new identifier; but I would suggest that you only change the identifier when the classification changes (including placement of a species epithet in a different genus), or when the set of heterotypic synonyms changes. If you try to get more granular than that, I think you'll be on the path to madness that @rdmpage alluded to.

CRAP! I just got to the post from @rdmpage that includes:

Firstly, why change everything, why not simply add new ids for those things that have been updated? In other words, you're doing a diff and saying these three records changed, everything else is the same.

OK, replace all of my paragraph above with "+1" on that post from @rdmpage . I could have deleted it, but what the hell -- maybe it says the same thing in a slightly different way.

My goal is to start with the type specimen as the anchor for a taxon, but refine that for splits and merges by comparing adjacent taxa and their types.

Yes, this comes back to the conversation we had in the living room of @dremsen. Use heterotypic synonomy sets as your computable mapping to when a new identifier is needed (i.e., protonyms as proxies for type specimens). This is imperfect, of course, when you don't have heterotypic synonyms listed, or when you need to divine the relationship between an earlier treatment and a later treatment (in the diagram, Aus bus sec. 1960 to Aus bus sec. 1970). But honestly -- without a @nfranz -style analysis (which itself is still ultimately subjective), you can't ever know whether Aus bus sec. 1960 maps to Aus bus sec. 1970; or maps to [Aus bus sec. 1970 + Aus fus sec. 1970]. In other words, you can't know from the data we generally have at our easy disposal whether Aus bus sec. 1960 was "split" into Aus bus sec. 1970 + Aus fus sec. 1970, or whether Aus bus sec. 1960 is congruent to Aus bus sec. 1970. Someday, when the @nfranz approach has been fleshed out across all of taxonomy, then these sorts of questions will be computable. But until then, it's probably best not to go down that rabbit hole.

Ooops!! I just now read the next post:

Well, the base for the taxon is the set of types.
[etc.]

I almost deleted the stuff I wrote above as redundant, but you can instead just treat it as a "+1".

@dremsen:

This is the direction I also favor.

I hope so! It was your living room, after all! :)

As to the set of posts related to this from @mjy:

Can anyone answer whether a taxon concept sensu the title of this issue has biological attributes?

@mdoering already answered exactly the same way I would, so I'll simply say +1 to his reply.

@rdmpage:

I'm still waving my arms around here (can you tell? ;) ) but I do wonder if part of the problem is seeing things as boxes rather than as timelines.

Another "WOW!" (+5). My arms are downright exhausted from years of waving in the exact same way. So, I already addressed this a bit above, but if those boxes represent specific/individual TNU instances, then I'm 100% onboard. If they represent abstract notions of name-independent taxa into which stacks of TNUs are folded, then I start to get a bit more dizzy. Again, I think the "set of heterotypic synonyms using protonym identifiers as proxies for type specimens" approach is (by far) the best path forward. Yes, some of the s.s. vs. s.l. distinctions will fall through the cracks, but those can be addressed later when we all catch up to @nfranz on this stuff. Whether we need to mint singular identifiers (of a different class) to represent sets of ProtonymIDs (vs. simply using the array of heterotypically synonymous ProtonymIDs as itself the mechanism for uniquely identifying the boxes) is, I think, an implementation question. I'd only advise exercising caution before dumping a new class of identifiers on the world, because you know it will be badly misunderstood and misused by the masses.

Back to @mdoering:

Yes, we will have different ids for a name and a usage.

If by this you mean "different classes of identifiers", PLEASE consider this carefully. I went through a LOT of painful mistakes when I did the same thing back in the 1990s; and when I saw the light that Protonyms are a subset of TNUs, the implementation side got MUCH easier.

Protonyms ARE TNUs; they're just a special subclass of TNUs. They have the same properties as TNUs. In 99% of cases, the Protonym is of the form "Aus bus Linnaeus 1758 sec. Linnaeus 1758" (there are exceptions, but mostly confined to old names that were first established in a non-Code-compliant way, then made available later -- this is something that should remain within the realm of nomenclators).

If you start minting different identifiers for the "Protonym" of Aus bus Linneaus 1758, separate from the "Usage" Aus bus Linneaus 1758 sec. Linneaus 1758, you will almost certainly regret it. At first glance it seems like the same identifier means different things depending on whether you're referring tot he Protonym of the name "bus", or the taxon concept asserted by Linnaeus in Aus bus Linneaus 1758 sec. Linneaus 1758; but I promise that this distinction is just an illusion. It would require more text than I've already written above to explain why this is so. But I can share some of the LONG emails I had with Kevin Thiele, if you want.

That raises again the question which properties exactly belong to a usage.

Just to continue and expand from what I already wrote above, I have been using these four properties to represent a "change" in an objectively identifiable way:

Classification (i.e., immediate hierarchical parent; not the full hierarchy to the top)
Set of ProtonymIDs representing heterotypic synonyms
Rank (e.g., full species vs. subspecies) -- this is essentially redundant to #1, but not always (e.g., when you go from Aus bus subsp. cus to Aus bus var. cus)
Orthography (exact literal UTF-8 representation of the epithet only; not the combination)

You could also add:

Reference/TNU used as a anchorpoint/basis -- such as when a new publication comes along that doesn't change any of the four properties above, but provides a much more robust diagnosis/etc. and thus represents a "meatier" foundation. But for computational purposes, this doesn't really add anything. For end-users, it might (and that would also bring it a step closer to the Meta-Authority model).

On the whole "versioning" thing, I think the immediate/important questions most people want to answer are:

What is the status right now from the perspective of my favorite/trusted Meta-Authority (e.g., CoL)?
What are the various perspectives in the literature for a given Protonym over its past history (including the alternative "current" treatments/views that differ from my favorite/trusted Meta-Authority)?

I think most people are a lot less concerned with "What is the history of how my favorite/trusted Meta-Authority has changed its views over time? Sure, that information should be tracked, and is interesting in some contexts but it seems more of an implementation thing. The "versioning" approach is one way to do it, but that requires new identifiers. The way GNUB handles it is with a robust audit trail (literally every change of every field in every record is logged with a timestamp and responsible party, so there is no "version" per se, just a timestamped change log for each record).

@mjy :

You have been tasked to do the impossible by the CoL.

In some senses I agree, but there is a really, really, really simple thing that CoL can at least encourage GSDs to do, and implement itself when the content exists (e.g., content through WoRMS and other robust GSDs), which is simply track one more piece of information for each record, which is "Reference we follow in making our assertion about current status". In other words, the bit after the "sensu". If you can just get that much information, it would be a quantum leap in the utility of the data CoL provides. And even if only a minorty of content providers can offer this information, you can always skip that step with a place-holder sensu someobody but we're not sure who approach, so at least the operational data model is functioning at the TNU level, not just the Protonym (or vague "name") level.

A big "+1" on all the rest of what was included in this post from @mjy (as well as several "+3"s and "+5"s!)

Also, LOTS of "+1"s, "+3"s and "+5"s (especially "Never, ever, ever embed information in the identifier...") in your follow-up pseudo-blog post.

As a matter of policy, encourage, slowly, but ultimately more forcefully, GSD providers to provide OTU ids.

I'm not sure it's the same, but I've been pushing hard (including above) for CoL to get the GSDs to provide a reference anchor point for each asserted "current status". We should move beyond the approach of "sensu GSD Year", and move towards "sensu Publication". Most GSDs are not practicing actual taxonomy within their databases; rather their databases usually serve as value-added indexes of what's happening in the literature.

that those GSD OTU ids be WikiData Q numbers.

Meh... I'm not sure that's the right choice. But I may be ab outlier in that.

What do you mean by type? Specimens? Type specimens don't define biological concepts,

Individual type specimens don't, but sets of types (as proxied through ProtonymIDs expressed as a heterotypic synonymy) most certainly do! I was at a meeting held at Smithsonian back in the 1990s, where this basic topic of discussion was focused in the context of FGDC Metadata Standards (of all things). Walter Berendsohn and Stan Blum and Bob Peet a few of the other early workers in this space were there. I outlined different levels of granularity with which one could define the boundaries of a taxon concept/circumscription:

Sets of individual organisms (e.g., explicit material examined)
Individual populations (usually proxied by geographic distributions)
Sets of individual characters (morphological and/or molecular characteristics)
Sets of type specimens, including among a heterotypic synonymy, as proxied by Protonyms (I hadn't yet coined that word in this context, but that's what I meant)

The last of these is obviously the least granular, and some might argue that (therefore) the least useful. But in the 2+ decades since then, it has become more and more obvious to me that defining taxon circumscription boundaries through sets of type specimens (proxied by ProtonymIDs, as included in an asserted heterotypic synonymy). As my wife once said, "It's better to be vaguely correct than precisely wrong". And while sets of heterotypic synonyms (as proxies for their corresponding type specimens), while vague, are almost purely objective in nature, and as such are in the realm of "facts" (I strongly support the point by @mjy about assembling and growing set sets of objective facts). Also, one can never enumerate, extrinsically, all of the individual organisms (recently dead, still alive, and yet to be born); so there is always an implied non-explicitly-enumerated set of organisms that should be included within the circumscription. I've also never been a fan of the character-based approach, because you always get the odd mutant individual that happens to lack some key diagnostic character which, technically, would fall outside the circumscription (even if both its parents fell within).

Even if no heterotypic synonyms provided, you can still infer the scope of the circumscription as inclusive of all organisms up to but not including the most recent common ancestor of the nearest relative/protonym/type specimen that I regard as *noT8 within the circumscription (i.e., the other related taxa recognized as valid). For those of us who are OK with paraphyletic taxa, it's a little more complex (but not much).

Anyway, this same basic idea was fleshed out in even more detail with @mdoering and @dremsen in the latter's living room (same gathering that produced the whiteboard image posted at the top of this thread). We were close then, and we're still close now. I keep participating in these conversations (as well as the ones happening in parallel in the tdwg/tnc group, and elsewhere), because I keep hoping that maybe "this time" we'll actually have a breakthrough and reach consensus. I had almost given up all hope, but I have to say that both this thread, and the direction happening over at tdwg/tnc, has boosted my optimism that maybe -- maybe -- we're getting close to consensus on some of this stuff!

Phew, that diatribe took me from breakfast all the way to lunch! Again, sorry for the long post, but there was a lot to cover from what y'all wrote while I slept.

P.S. If I didn't quote/comment on the above, then you can pretty safely assume that I'm a "+1" on the rest of the comments in this thread.

from general.

rdmpage commented on May 29, 2024

Lots to think about here, and I've some reading to do. As a side note I wanted to comment on identifiers. There are bigger hills to die on, and I know I was just begging to be slapped for bringing up uninomials as identifiers - see also comments on Taxonomic concepts: a possible way forward, - but a few thoughts (and I don't want to derail broader discussion, feel free to completely ignore this).

By "hierarchical" identifiers I had in mind the notion of URLs as API, that is, how would someone query the data, and couldn't those queries be expressed as URLs that also serve as identifiers? This leads to a clean interface that gives people the answers they are looking for, and a way to automatically cite the identifier for that information.
I also wanted a way to emphasise that I don't think all the concepts being discussed are the same thing. For example, the whiteboard diagram could be interpreted as six things that are all of the same type, whereas I see three paths (graphs) with some points (nodes) along the way. It seemed easier to make that case if I used identifiers that explicitly identify wholes and their parts. Bit like having identifiers for journals, journal issues, articles, and parts of articles.
I'm not particularly wedded to the notion of uniniomials (or some variation on them), my motivation there was to have something that is human readable and familiar (for example, so I can use them when doing diffs between trees and quickly understand what is going on). Despite what people may think, I suspect having short, friendly identifiers matters when trying to sell the idea to people, and it also means we can draw on earlier discussions in the field where people have confronted the problem of identifiers for species. There's a literature that goes back at least to the 1930's, and has been revived in the last few decades in the context of the phylocode. In other words, when presenting these ideas to other taxonomists we can say "look, this is an issue that our field has known about for a long time, it's not just the ramblings of a few computer obsessed geeks trying to make your life difficult".
I think the notion of opaque identifiers is often misunderstood. It's not that identifiers shouldn't contain information, it's that a consumer shouldn't expect to be able to interpret that information reliably. In other words, if I have an identifier such as a SICI or a DOI that contains an ISSN, it is likely that the ISSN is the ISSN of that journal containing that article, but it might be since changed. If I have an identifier that contains an integer n, it's likely that an identifier with n + 1 is more recent (e.g., Wikidata), but this need not be the case. It's not an injunction to not embed information, it's a warning not to interpret the identifier as informative.
I think some have interpreted the notion of opaque identifiers as grounds for having obfuscated identifiers, such as UUIDs. In other words, let's make damn sure people can't interpret the identifier (and there maybe good reasons for that). I think arguments that identifiers are only designed to be read by machines not by people miss the point - in order for identifiers to be useful they have to be adopted by people, be they developers, users, etc. Identifiers such as DOIs have gained widespread acceptance partly because they are highly visible, and in most cases pretty easy to read. You just have to look at the number of time publications break identifiers embedded in text (e.g., UUID based LSIDs, long DOIs) by inserting line break characters in the middle to realise that the choice of identifier syntax matters.

I guess I'm arguing that it is easy to be dogmatic and say that:

Identifiers should always be opaque
Identifiers should only be designed for machines not people
Identifiers shouldn't be hierarchical

but I think things are more nuanced than that.

Anyway, back to reading the stuff that matters...

from general.

mdoering commented on May 29, 2024

Let me just remind everyone that this issue is about what defines a taxon concept in the CoL. The definition of a unique taxon concept in the CoL defines what algorithm we need to compute equality and this stable ids.

The problem is not that we don't have enough identifiers. The problem is that we have too many. All of the boxes on that diagram get a separate TNU identifier. The problem is that there are dozens/hundreds/thousands of potentially relevant other TNUs for congruent circumscriptions, each with their own identifiers, and it's not clear which one to use. For example, there may be many publications that all represent the same set of heterotypic protonyms as shown in box 5 of the whiteboard diagram. Which one becomes the "identifier" for that box? The chronologically first to synonymize A. xus under A. bus? The one who provided the most robust taxonomic treatment? What people seem to want is a single identifier for "Box 5", which may serve as hub for dozens of individual TNUs all asserting the same/congruent taxon. This is where "TNU as surrogate for concept/circumscription" gets messy, and requires a third party to "elevate" one of the TNUs as the surrogate/proxy for the "box". But in any case, the problem isn't a lack of identifiers to use for concepts. The problem is an overabundance of them.

Surely there are many ids and even more usages out there. But that is not what the CoL is about.
Our main problem is comparing previous editions of the CoL with the latest version about to be released and to assess under which id to release the taxon under.

The other use case is the Clearinghouse, where we keep many external "checklist" datasets that can act as a source for the CoL, but don't have to. Theses lists (mostly taxonomic trees) come with their own usage ids and we retain them (in contrast to GBIF ChecklistBank where new integer ids are issued). In order to navigate across datasets we have a names index that allows to find the same name across datasets, even, for example, if the authorship was spelled slightly different. Similarily we want to establish a taxon concept index that can be used to find equal concepts across datasets without requiring them to use the same accepted name. I am well aware there are many definitions for both a unique name and taxon concept. For very valid reasons. But for our implementation we need to select one definition that can be used to setup the names and concept index.

As said before, as a starter we will probably try to use the set of protonyms to build the taxon concept index. We are not trying to perfectly model the world of taxonomy and publications. We need something workable in a reasonable amount of time.

As for the style of identifiers we want to use see CatalogueOfLife/backend#491

from general.

mdoering commented on May 29, 2024

every usage gets an identifier, so the question to ask is not "is this the same as that?" it's what usage (if any) do I want to refer to?

+1 (is it possible to add a "+5"?)

In that case @rdmpage should be happy about CoL and ALA issuing new identifier all the time in every release. But most people including Rod seem to hate that.

from general.

mdoering commented on May 29, 2024

Yes, we will have different ids for a name and a usage.

If by this you mean "different classes of identifiers", PLEASE consider this carefully. I went through a LOT of painful mistakes when I did the same thing back in the 1990s; and when I saw the light that Protonyms are a subset of TNUs, the implementation side got MUCH easier.

That is one thing I would like to rollback if I could start again. Separating names and usages seems more of an idealistic thing. So far I do not see any benefits over just having NameUsage instances that have joined properties. And the implementation got way more complex with having names and usages separated.

from general.

rdmpage commented on May 29, 2024

@mdoering

every usage gets an identifier, so the question to ask is not "is this the same as that?" it's what usage (if any) do I want to refer to?
+1 (is it possible to add a "+5"?)

In that case @rdmpage should be happy about CoL and ALA issuing new identifier all the time in every release. But most people including Rod seem to hate that.

Because I think the identifier most people will want is a set of usages, not any particular one. A bit like this thread, I can point to an individual comment #6 (comment) or the whole thread #6. My view is that in most cases, the whole thread ("taxon") is what people will refer to, they'll refer to a comment ("usage") if they feel the need for that level of specificity.

I think this is why people like to link to names, they have enough specificity (that name) and yet enough slop (all mentions of that name). I think ideally taxon ids would have a similar attributes, perhaps with more resilience as they needn't change with changes in name. Otherwise there is limited incentive to link to them (a lot of the work I did in 2018 to link to ALA is now broken because ALA doesn't value identifier stability as much as I do).

from general.

rdmpage commented on May 29, 2024

@mdoering

Let me just remind everyone that this issue is about what defines a taxon concept in the CoL. The definition of a unique taxon concept in the CoL defines what algorithm we need to compute equality and this stable ids.

OK, we've had our fun now. Apologies for hijacking this thread.

Regarding the specific issue you sk about, can I suggest framing it slightly differently? Presumably you have a classification already (CoL-now). Based on aggregating the data, you have a new classification (Col-future) that you want to release. You want to assign identifiers to taxa in that new release.

For example, currently you have for Opisthotropis balteata (Cope, 1895) the id
http://www.catalogueoflife.org/col/details/species/id/e5b7c4081a35d451a9c187e327793765 based on the Reptile database for 2015-12-15. When you ingest the latest Reptile checklist you'll find this is now in the genus Trimerodytes. I would retain the current identifier e5b7c4081a35d451a9c187e327793765d despite the name change - it's moved genera, but in some sense is still the same thing (for various definitions of "same", other definitions are available). Likewise, in most cases like this I would NOT change the id for the genus even if it gains or looses species, as far as the edit script is concerns those nodes don't change.

So, in practical terms, I would do a tree diff between the two classifications to find the minimum number of edits required to convert one tree into another (deletes, inserts, moves). Inserts are easy, that's a new taxon, that's a new id. Moves are typically species from one genus to another, I would retain the same id. Deletes are easy, they no longer exist (kidding). Deletes are likely to be that are newly synonymies names, but I think a way to do that is have the synonym as a child of the accepted name (I think you've done this before when I talked about tree edits a while back).

Now I know that most of this doesn't match the "taxon concept relationship" discussion about how much does something change before it's considered new, but I think most of that is intractable (hence this thread). But I think arriving at a release where the minimum possible number of identifiers change is going to be welcomed by those who link to CoL. The tree diff approach would also enable you to explicitly generate a list of changes (i.e, release notes). In a way by framing it as an information management question (what is the minimum number of operations to convert one tree into another) you can side-step the biological arguments -thus pissing off everyone equally ;)

Hope that is more on topic.

from general.

mdoering commented on May 29, 2024

Thanks @rdmpage, that is indeed what I am looking for. A tree comparison is rather difficult on that scale, but let's try that out.

The requirements for a solution are:

it needs to be computable based on the data we have. This guarantees a consistent approach, allows users to understand when and why ids change and also have data at hand that explains the change. The id does not change for some opaque reason that is not encoded and visible directly in the data
it should be more stable than a name based id. The use case is to provide an identifier that moves along when the accepted name changes.

Solutions that come to my mind:

name based ids - the baseline. This is what we will start out with this year
protonym based - stick the id to the protonym and use it for its currently accepted name. This seems to be the same as @rdmpage describes in the tree diff. It requires knowing the basionym, see below
protonym set based on analysing the entire synonymy - requires knowing the basionym
name with direct parent taxon or even the entire classification. This leads to less stable ids than the name alone. But maybe it is important for users to have a different id if the classification has changed?

As CoL traditionally has not asked for the basionym of a name, it will take a while until we get that information for the majority of names. It is unlikely we will know it ever for all names. But we can augment the GSD information with nomenclators or even other datasets? It is also often rather obvious from the authorship and can be (provisionally) inferred in large number of cases

from general.

rdmpage commented on May 29, 2024

@mdoering Makes sense to me. Getting basionyms will be a hurdle in some cases, but often guessable from the names (as you've been doing for the GBIF taxonomy), and some databases (IPNI and IndexFungorum explicitly link to basionyms).

If I understand the tree diff approach correctly, then really the only new ids would come from adding nodes (taxa). Moving nodes doesn't change ids, only their relationships change. This makes life simple, but is unlikely to please those who regard taxa as defined, for example, by extension (set of descendants). Perhaps a solution is to store the edits made, so that you can retrieve each node affected by an edit (e.g., a species moving from one genus to another is a deletion from one genus and an addition to another). People could then subscribe to that series of edits and update their own definitions of taxa accordingly.

But back to the topic, regarding scalability, I've not investigated the performance of the code I wrote with Gabriel Valiente forest, but I presume it would be straightforward to partition the CoL classification by major taxonomic group in a divide and conquer approach. Of course, there may also be other/better algorithms and/or tools available.

from general.

dremsen commented on May 29, 2024

I am often wrong and never entirely right but will use a made-up story to illustrate the key points in my understanding of what should and should not count as properties of a taxon concept when minting and changing identifiers for them for the COL. My story involves three of us, variously fictionalized. It assumes Rich maintains the COL fish GSD and Markus and I are fish biologists of dubious reputation.

I caught a fish. It's a specimen.

Rich and Markus and I all assess my specimen.

Rich looks at it and says "I don't know how you got this snorkeling in Woods Hole but this is a specimen of Chromis abyssus. When Rich does this, and to paraphrase a past Rich Pyle, he is saying "The specimen you caught is conspecific to the type specimen that I collected and is thus congruent with the concept I had when I caught my specimen and formally described it as a new taxon according to the rules of the code." So my specimen, according to Rich, is an instance of that concept. The concept itself didn't change.

Markus looks at the specimen and sees it a bit differently. He insists Rich has misidentified my specimen and that it is actually a different species, Chromis margaritifer. I don't know why Markus thinks this. But Rich's concept of abyssus still has not changed.

Remsen says "yeah, but look at the tail! Rich said nothing about the tail having a spot" and insists that it's a new species. Rich says "Pfft, not sure that's a spot. I have seen them before." and his next revision of his GSD makes no mention of me and my delusions. My concept doesn't count. He does make a notation of Markus' observation when he updates his GSD.

Chromis abyssus, Pyle 2001 (accepted name)
Chromis margaritifer Fowler, 1946, sec Döring 2020, (chresonym) misidentification.

In doing so, he is saying that the fish Markus identified as margaritifer is really just another abyssus. It's not a synonym because the specimen was not a type. So it's a misidentification, according to Rich, and the citation is a so-called chresonym,.

But I'm not done. I do some research, some DNA barcoding, and make a bunch of fancy drawings. I write it all up. I put my specimen (holotype) in a jar and publish my paper in the journal, Calodema, carefully following the rules of the Code. According to those rules, Remsen's concept has now entered the realm of taxonomy and the taxon "Chromis hawkeswoodii" becomes a real (short-lived) species.

During his next revision, Pyle's annotated checklist, published through Aphia, begrudgingly contains some new entries.

Chromis abyssus, Pyle 2001
Chromis hawkeswoodii, Remsen 2020 (heterotypic synonym)
Chromis margaritifer Fowler, 1946, sec Döring 2020, (chresonym) misidentification. #this is removed by the COL because it is not a 'real' synonym and should not be used to improve recall in search.

Pyles concept of abyssus hasn't really changed. Pyle would assert that the holotype of hawkeswoodii is conspecific to abyssus. As we saw, he did that originally when my fish was just a specimen, prior to me doing all this work and turning it into a holotype. But since I went to the effort to enter my concept and associated specimen into the pantheon of taxonomy by following the nomenclatural rules, his concept has changed. It includes the original 'protonym:' a nomenclatural term which is really a proxy for Rich's novel concept. It also now contains Remsen's concept of hawkeswoodii and it's associated protonym and holotype. Protonym count went from one to two. Concept changed.

This is essentially how I interpreted the litany of taxonomic publications I reviewed when trying to develop an inclusive taxonomic model with computable concepts. I'm not saying it's right. But I will say it was useful for:

Improving recall in search by providing a list of synonyms that will reduce false negative returns in search.
Improve precision by
2a. establishing an index of 'concepts' with distinct 'protonym circumscriptions' (remember these are proxies for concepts that are always imperfectly described in publications)
2b. establishing an identifier that could be applied to to a data object that would distinctly identify one concept labeled with the same name as another, also distinctly identified concept.

from general.

mdoering commented on May 29, 2024

@rdmpage we will always keep the history, so you can use the taxon id and go back in time what it looked like in the CoL in a specific edition. So you can get the entire history for a concept as it appeared in the CoL. That allows people to link to just the id which takes them to the most recent version of it. Or they link to a specific edition of the CoL for which results will be immutable. I think that should give users enough freedom to select the kind of id they need for their purpose.

from general.

deepreef commented on May 29, 2024

@rdmpage :

By "hierarchical" identifiers I had in mind the notion of URLs as API, that is, how would someone query the data, and couldn't those queries be expressed as URLs that also serve as identifiers? This leads to a clean interface that gives people the answers they are looking for, and a way to automatically cite the identifier for that information.

Yes, I could definitely get on board with that. I guess whenever I see the word "identifier", I immediately jump to a notion that places most emphasis on "globally unique". Among the things I like about databases are precision and a lack of ambiguity. Part of my infatuation with UUIDs is that when I throw something like 8bdc0735-fea4-4298-83fa-d04f67c3fbec into a resolver engine (Google, ZooBank), there is no ambiguity on a global scale exactly what I'm interested in. Another part is opacity, along the lines of the point made earlier by @mjy

However, more in line with your point, I agree with you that URLs as API also function as identifiers of sort. For example, when I emulated your proposed identifier system in ZooBank:

http://zoobank.org/Search?search_term=abyssus+Pyle+Earle+Greene+2008

Sure enough, I got only one result. In fact, the same is true when I limited it to only the first author:
http://zoobank.org/Search?search_term=abyssus+Pyle+2008
[Incidentally, I checked for uniqueness in GNUB using only the first author name, instead of all author names, and I ended up with a nearly identical result of 96.6% uniques; so first author is just as good for this purpose as all authors.]

With a little bit of alteration to the website code, I could make ZooBank follow the "I'm Feeling Lucky" principle and go directly to the record if there is only one result. I could also tweak the code to eliminate the explicit (and unnecessary) "Search?search_term=" bit, so the URL could just be zoobank.org/abyssus+Pyle+2008. [NOTE: I stripped the http prefix on non-functional URLs, so GitHub wouldn't create hyperlinks out of them.]

In that sense, the identifier "zoobank.org/abyssus+Pyle+2008" would indeed be functionally equivalent to http://zoobank.org/8bdc0735-fea4-4298-83fa-d04f67c3fbec. I don't think I would go so far as to index "[abyssus+Pyle+2008] sameAs [8bdc0735-fea4-4298-83fa-d04f67c3fbec]" in bioguid.org; but that doesn't mean your point about URL-APIs as human-friendly identifiers that work 96.6% of the time isn't useful. And sure, I could relax my own idea of the word "identifier" to even think of this as an identifier.

As for "hierarchical", I'm not entirely sure I understand what you mean in that sense, but perhaps what you mean is that instead of "abyssus+Pyle+2008", you could start with just "abyssus" (as in, "zoobank.org/abyssus"). In ZooBank, you'd get four results:

Chromis abyssus Pyle, Earle & Greene, 2008
Derolathrus abyssus Yamamoto & Parker in Yamamoto, Takahashi & Parker, 2017
Parabaeus abyssus Austin, 1990
Rhinecanthus abyssus Matsuura & Shiobara, 1989

So then you'd need to go to the next level, with something like: zoobank.org/abyssus/Pyle
That would get you down to one result, and a likely winner.

So, having no idea what you meant by "hierarchical", I'm imagining my own version of a "hierarchical" API/Identifier system that starts with the first tier of only the epithet, which by itself would (remarkably) get you only one result about 75% of the time. In the 25% of cases where it's ambiguous from the epithet only, going to the next tier and adding the first author name only will get you a single result about 93% of the time. And, as already mentioned, adding the year expands that to 96.6% singletons. Just out of curiosity, using the year as the second tier (instead of author) yields almost exactly the same result as only the author (93% singletons).

OK, I'm rambling now, and so far have only responded to the first point of the first response to my post, and I see there's a lot more yet to read. And it's not even within the scope of his particular thread, as noted by @mdoering

So I'll stop now, as I need to get ready to go out for a dive with my son; but when I come back I'll read through all of the new posts, and will strive to come up with a MUCH more concise and coherent reply.

from general.

deepreef commented on May 29, 2024

OK, I lied. One more reply before I go diving.

@mdoering :

Surely there are many ids and even more usages out there. But that is not what the CoL is about.
Our main problem is comparing previous editions of the CoL with the latest version about to be released and to assess under which id to release the taxon under.

This is why I've been pushing so hard for CoL to move to a TNU model, rather than some sort of fuzz "name" model. Like all Meta-Authorities (including all the GSDs that provide content to CoL), it should not be in the business of making statements along the lines of "Aus bus is a valid species" and "Aus xus is a synonym of "Aus bus". Instead, it should be making statements along the lines of "We follow Jones 2019 for Aus bus". Because Jones 2019 treated Aus xus as a junior synonym of Aus bus, the synonymy is automatically inherited from the statement.

On a more technical level, here's how it should work:
CoL (via GSDs) should anchor all names of valid species to Protonyms. You already have the content to do this, even if you don't have the full literature citation details of the original description. GNUB can provide the UUIDs to every Protonym in CoL -- I can accomplish that in a weekend or two. As long as the GSDs have their own unique identifier, they don't need to incorporate the Protonym UUIDs because CoL (or better yet, BioGUID.org) can maintain the cross-link index. If GSDs don't have persistent unique identifiers... well, then perhaps it's time to retire those GSDs from CoL (or focus on upgrading those GSDs).

So, CoL then becomes an index of all the world's Protonyms that represent valid species. This Index then needs to have only one other piece of information attached to each Protonym record: The TNU for the treatment that "gets it right" for this taxon.

Yes, I know that GSDs don't provide this information, and it's impractical to get them to do so anytime soon. But my point is that the ProtonymID + AcceptedTNUID model should be the defined endpoint for where CoL should be heading. It will never get there if you don't start exploring the actual mechanism to do so. I agree: it's not at all feasible to apply this to all names across all taxa (and all GSDs). But there is a non-trivial amount of content that it could be applied to. All fishes, for example. At the very least, you could explore this as a "Proof of Concept" approach embedded within a more generalized approach, for the subset of records where ProtonymID+AcceptedTNUID are available; while still maintaining the less effective method for recognizing changes based on the combination of [Immediate Parent]+[Heterotypic synonymy]+[Rank]+[Orthography] approach.

Ultimately, CoL should not be in the business of minting its own identifiers. Instead, it should be a broker of TNU identifiers, putting a "gold star" on selected TNUs that serve as surrogates/proxies for the "box", in which all other TNUs sharing identical [Immediate Parent]+[Heterotypic synonymy]+[Rank]+[Orthography] patterns are placed.

I know that's a long way into the future, but if that is defined as the end point now, it will make the road to get there all the smoother.

from general.

mjy commented on May 29, 2024

Offtopic!

@deepreef TNUs != OTUs. The former are handled in TW by NOMEN+Citations IIRC. OTUs are what WikiData are doing, just an anonmyous QID + data, some names, some not, I think. Requiring nomenclature to define biological concepts doesn't universally work (bacteria, genetic species concepts), so why not abandon this approach from the get go (don't answer here). In TW we embrace OTUs. Users define a list of OTUs to export to their GSD. We crawl the list of OTUs to find out what nomenclature should come with the list.

On topic, but not very constructive

Good luck with the tree diff approach. Note that AFAIK CoL doesn't really manage a classification as I think @rdmpage is envisions they do. Until very recently they didn't even return some of the commonly used ranks. The classification that does exist is human constructed based on the Editor appending sectors onto a tree.

I assume that a more complete classification for the purposes here will be built by algorithm. I assume it will have all the same issues GBIF's does. So take that into account when you assume stability of identifiers embedding information derived from algorithms. For example, one species of tenebrionid appearing in 4 kingdoms by the time it gets to GBIF collapses the consensus, to use another tree-based concept.

Oh, you'll also need to embed versioning into the whole system, as the algorithm will clearly evolve as you struggle to find any use for it. Each commit to the algorithm will render past identifiers for concepts meaningless, as it will no longer have the same rules, and trying to figure out what changed between versions with respect to species concepts will only be useful as a sadistic test for graduate students taking computer science prelims. ;)

from general.

rdmpage commented on May 29, 2024

@mjy I'm not quite so pessimistic, but don't have data to argue the point. The tree diffs needn't operate on CoL itself, they could be applied to the input classifications from the source databases (e.g., the reptile database mentioned above).

from general.

deepreef commented on May 29, 2024

Back from diving, and lots to think about/discuss. But quick for right now to @mjy "Off Topic", which I actually think is very much "On-topic", because IIUC (not sure if that's a thing, = "If I Understand Correctly"), @mdoering is trying to answer the broad question "When do I mint a new CoL Identifier, vs. when do I modify properties associated with an existing identifier?" (CMIIW, @mdoering ). The simple answer to that question is, "When the concept/circumscription is different!" But that's not a very useful answer, because we haven't yet answered the prerequisite questions, "What is a concept?", "Is it the same as a circumscription, or different?", and more to the point, "What are the core properties of a concept/circumscription such that a change in one of these properties results in an implied different concept/circumscription?"

So, in that context, the clarification that "TNU != OTU" is both very helpful and very relevant to these prerequisite questions.

To start, a bit of clarification of my own. Although the "N" part of TNU is often assumed to be a Linnean-style scientific name (and that's where most of our focus has been), that's not necessarily the only context in which the "N" part applies. There's been some discussion of this over at tdwg/tnc, but I would certainly include some classes of non-Linnean names (and some advocate for opening it to all text-string labels, including vernaculars/etc.) But the point is, Linnean-style nomenclature is absolutely not required for TNUs to work either. But I'm pretty sure that the "T" part of TNU is the same as an OTU (if not, then CMIIW).

So here are some questions about OTUs in this context (i.e., the WikiData notion of it, as adopted by TW):

Do they always have some sort of text-string label associated with them? I'm assuming the QID at least, but is that the only way to cite them?
What properties of the "Data" part help you determine whether you're dealing with a new instance of an existing QID-branded OTU, vs. an OTU that requires the minting of a new QID?

I think these questions are on-topic for the issue sought by @mdoering, because if a CoL "thing" is the same as a WikiData/TW OTU "thing", then understanding the logic behind how new QIDs are minted vs. amended of OTUs might directly address the same question in the CoL context.

PS, Before I wrote the above, I didn't know that "CMIIW" was a thing, but evidently it is. I also just now learned that IIUC is a thing too.

from general.

dremsen commented on May 29, 2024

Your simple answer is useful, Rich, because it's a good start. You mint a new identifier if, and only if, the concept changes. Anything else and your identifier must be referring to something else. A concept changes when something is added to it or removed from it. What is that something? It's clear that one answer, at least, is that a concept changes when you add other taxa to it or split taxa (new or previously included) from it.

from general.

mjy commented on May 29, 2024

Warning, off topic sensu my take on requirements for #6, includes themes repetative with previous spewing by me

@deepreef we all have ideas about what how identifiers should be minted for OTUs, @dremsen's ideas are perfectly fine. We know that we need new IDs for new concepts.

Any aggregating system in place to track all concepts on Earth must be setup to handle the simplest of all cases- the curator of the GSD provides a unique identifier that they assert will only change if their curated concept changes. If you don't have a data model for this basic use case in place, then you're not handling the best case scenario.

To me, the OTUid (QID for example here, but really could be a big UUID, whatever - just no meaning plz) coming from the curator of a GSD (these species concepts don't just come from nowhere, they come from blessed lists of various quality as curated by a human) is the single best way to track differences. If the curator changes the id, they understand that they are asserting a new taxon concept. The way we teach them to think about this is that if you had concept A, and you did science 1, and then concept B, and science 1, you hypothesize that you might get a different answer. We force curators to think of list of OTUs, not list of names because the CoL is a list of OTUs, and the names we can use to get near to them.

I wish I had my philosophy of science notes from undergrad back in front of me. The course so elegantly pointed out all the problems trying to uniquely identify things. Definitions based on sets, expanding and contracting definitions, all chairs and not chairs, etc. etc. All of them failed in some cases. This is extremely well understood philosophically. The exercise here would fit right into one of those bodies of thought. What to do then? At the end of the day, what you need are meaningful units. What is a meaningful unit in our case? The thing you can do science with. What thing? A species concept, something "real". That unit, gets a single, anonymous ID, Q, or other meaningless URI, etc.

To your questions:

Do they always have some sort of text-string label associated with them? I'm assuming the QID at least, but is that the only way to cite them?
- THe QID is how to uniquely identify them. How we localize to that concept (localizing to information being a very useful concept that should be embraced IMO) can happen in many different ways, names, "things that are red", hyperlinks, printing the QID, it doesn't matter as long as a one has a reasonable path that works in most cases (not even all, I understand biology is vast and tricky).
What properties of the "Data" part help you determine whether you're dealing with a new instance of an existing QID-branded OTU, vs. an OTU that requires the minting of a new QID?
- A) In part this doesn't matter presently, because we're not asking our GSDs providers to actually provide data, just localizers to some concept, that isn't uniquely identified. So, given the data in CoL alone I can't do anything to determine if the concept has changed (thus my rants that @mdoering is being asked to do the impossible). B) If they were providing actual data defining the concept then as a scientist I would look to see if that data is suitable for the hypothesis I am testing. For example I am not selecting morphological species concepts to hypothesize about rates of molecular evolution. To assume we can do good science in the absence of understanding (for example by proxy of an algorithm defined identifiers) is foolhardy IMO. This is particularly critical at scale, i.e. across all species on Earth. If the CoL starts minting ids under one namespace for all its aggregated data from its data-sources of various quality, then what's going to happen? Those who are doing "robust" science are going to assume CoL has done due-diligence and things with similarly namespaced IDs identify things with similar meaning. They don't, of course we'd never claim this. So, the basic thing I'd like to see is to have the curators, the people that make GSDs possible, put their money where there mouth is and say- "This is a species concept you can do work with, and you can do similar work with the rest of things on my list (your list may differ). You can also assume that I'll be damn sure to provide a different QID if I enumerate the list of species and come up with something different. I sure hope your global list has a basic spot to track my unique assertions that isn't some name, we know that nomenclature is nuts." This is after all the job of GSD providers (currently the sole source of data to CoL, if that changes then elements of this argument change). How do they do this? They provide a QID (or UUID, etc., some globaly unique ID) to uniquely identify their concept. QID stays the same, and names change, @mdoering knows that's just nomenclatural mumbo jumbo. @QiDS differ, and names stay the same- similar mumbo jumbo. Different concepts identified by different systems of IDs? Well, get your science on, localize to those concepts and figure it out, that's about as best you can do. Any taxonomist does this naturally without thinking, it seems very very strange to me to pretend this isn't necessary. It also seems very very strange to me to hide this hard work with algorithm based identifiers so ecologists can do bad science. We can do better, we can trust GSDs and give them simpler ways to uniquely identify their concepts. TW is trying to do this, and there have been many cases where because we've linked data to the right concepts, things (data curation actions) have become trivially simplified. For example specimens are determined as OTUs. When nomenclature facts are added the specimens don't MOVE. They don't change OTUs. The facts are added to the nomenclature, and the OTU is pointed to its current valid nomenclatural name. You can stack as many OTU determinations on specimens as you want, each can reference the nomenclatural facts as they were presented. At the end of the day a curator has specimens under some current OTU, and they have the history of determination separate from the history of nomenclature, or inter-twined in a timeline display if need. Exactly how nomenclature is supposed to work IMO. Now, if you linked specimens to names instead of the proxy OTU concept (which too many systems do presently), and you had to split or merge the names, you'd have to somehow decide which name keeps the links to the specimens, and which doesn't Ugh! This is just one example of the nice division of labour we get when we have an OTU table, and manage nomenclature as nomenclature.

TLDR - I don't believe we can do better without a different data model at the core ("anonymous" nomenclature free concepts), and better tools and processes for GSD curators.

from general.

deepreef commented on May 29, 2024

@dremsen 👍
I think that's exactly right! But when you say "add other taxa", at the species level what that means is that you are adding another heterotypic synonym, which means you're adding a new type specimen to the concept. However, it's not that simple. First, there are all the OTUs that don't have Linnean-style names. I fully agree with @mjy that requiring [Linnean-style] nomenclature to define biological concepts doesn't universally work. So the "type specimens as boundary markers for concept circumscriptions" can only go so far (i.e., can only really work int he context of taxa signified with Linnean-style names anchored to name-bearing types).

Second, consider this scenario:

Smith 1950 describes the new species Aus bus from specimens in Hawaii (TNU: Aus bus Smith 1950 sensu Smith 1950)
Jones 2010 describes the new species Aus xus from specimens in Palau, and declares its closest relative to be Aus bus Smith 1950 (TNU: Aus xus Jones 2010 sensu Jones 2010; TNU: Aus bus Smith 1950 sensu Jones 2010)
Pyle 2015 treats Aus xus as a synonym of Aus bus (TNU: Aus xus Jones 2010 sensu Pyle 2015; with TNU: Aus bus Smith 1950 sensu Pyle 2015 as a heterotypic synonym)

We've got five TNUs here, four of which represent taxa asserted to be valid. The fifth TNU is Pyle's assertion that the type specimen of Aus xus is conspecific with the type specimen of Aus bus, and because Aus bus has priority, his (Pyle's) concept is labelled as "Aus bus", but it includes both Jones' concept of Aus bus and Jones' concept of Aus xus (not always the case, but for sake of simplicity, let's say it's true in this case).

So, suppose the 2009 CoL has ID1234 associated with Aus bus, which we'll infer to be Aus bus Smith 1950 sensu Smith 1950.

Now Jones comes along in 2010 and names Aus xus, so CoL mints a new ID9876 for Aus xus Jones 2010 sensu Jones 2010 to include in its 2011 Catalogue.

Here's the kicker: Does CoL issue a new ID for Aus bus? If so, why? How would CoL ever know whether this is a case of Aus bus being "split" into two species by Jones, or it's just a new discovery of a new sister-species (Aus xus) to the already established Aus bus?

The problem is that Smith 1950 didn't examine any specimens from Palau, so we have no idea whether he would have included specimens from Palau within his circumscription of A. bus, or if he would have agreed with Jones that the Palauan species is different. So at this stage, CoL can't decide, based on the information it has, whether it's representation of Aus bus needs a new ID, or can keep using the same ID.

However, suppose that CoL has a TNU-based model, and for its 2009 catalogue it anchored the record for Aus bus to the treatment of Remsen 2005 (TNU: Aus bus Smith 1950 sensu Remsen 2005). With a little bit of @nfranz - style sleuthing, we discover that Remsen examined specimens from Palau and declared them to be Aus bus. Now we have a good idea that CoL had defined its record for Aus bus s.l., so by recognizing a portion of this circumscription in the form of Aus xus Jones 2010, we know that a new s.s. circumscription is needed for the CoL record of Aus bus, and this a new ID is created for Aus bus s.s. to distinguish it from the earlier CoL record with ID1234.

Of course, it's rarely the case that there are only two alternatives, so "s.l." vs. "s.s." is kind of useless. A MUCH better approach is to, instead of "sensu lato" and "sensu stricto", CoL explicitly uses "sensu Remsen 2005" and "sensu Jones 2010" (respectively).

The problem, though, is that it takes a bit of @nfranz - style sleuthing to make this determination, and CoL can't incorporate that information into its records. However, it can make something of Aus bus Smith 1950 sensu Pyle 2015, because this TNU also reveals the second type-specimen-by-proxy of the protonym link embedded within Aus xus Jones 2010 sensu Pyle 2015 pointing to Aus xus Jones 2010 sensu Jones 2010.

If anyone actually followed that, I'm deeply impressed (I had to re-read it several times myself, and I still probably screwed something up). But here's the short summary point: With a TNU model, you can do a pretty powerful job reasoning/computing backwards through time (e.g., comparing Aus bus Smith 1950 sensu Pyle 2015 to Aus bus Smith 1950 sensu Jones 2010), but it's much harder to reason forward in time (e.g., comparing Aus bus Smith 1950 sensu Smith 1950 to Aus bus Smith 1950 sensu Jones 2010).

OK, more to come, but I'm approaching this one point at a time.

from general.

rdmpage commented on May 29, 2024

And it seemed we were getting so close to a resolution of the issue...

My sense from this discussion is that there are (at least) two different approaches to the topic.

ids should reflect our knowledge of taxa, and two taxa have the same id only if they are the same. If taxa change, they get a new id. I note that agreement on the meaning of same and change seems, um, elusive (cue numerous "A. us, A. bus" discussions), but I digress. Hence with each interaction you want ids that faithfully reflect current taxonomic understanding, and hence reflect changes in taxa (however defined). One consequence is that downstream users of these ids (e.g. people linking to them in their own databases) will be faced with regular changes to some (most?) ids.
ids should be as stable as possible so that they provide a reliable basis for external linking (e.g., by downstream users, Wikidata, etc.). Hence with each iteration, the goal is to minimise changes in ids. Downstream users will be able to link with confidence that the id is likely to be stable, with the proviso that what the id represents may itself have changed in ways that some users would consider meaningful (e.g., a genus has acquired additional species from another genus).

I am not sure we can do both, so I think the real question is which outcome best reflects CoL's goals? I'm guessing it's no surprise that I value identifier stability (2) more than fidelity to particular taxonomy (1), so I would vote for 2. This also means that I regard the "A. us, A. bus" discussions as essentially beside the point. The likelihood of me using CoL identifiers is mostly a function of their stability and how interconnected they are with other identifiers.

Obviously, if faithful representation of what the ids point to matters more (e.g., you can't accept that a genus with the same name but different component species can have the same id), then you will favour 1, and then the crucial issue is defining a set of criteria for determining identity of taxa (@mdoering original question before the rowdy neighbours turned up with alcohol and music).

In a sense these aren't completely different positions (obviously 2 still depends on some notion of same, in this case similarity of edges in the graph i.e., parent → child pairs having the same labels) but it seems to me that 1 is effectively blocked in the absence of agreement on the operational meaning of same. Likewise as @mjy has pointed out, advocates of 2 (e.g., me) have argued for a simple tree diff approach without demonstrating a working system.

So, in summary, it seems to me that there are two separate goals here: fidelity to changing concepts vs stability of identifiers in the face of change.

from general.

mdoering commented on May 29, 2024

@mjy

Any aggregating system in place to track all concepts on Earth must be setup to handle the simplest of all cases- the curator of the GSD provides a unique identifier that they assert will only change if their curated concept changes. If you don't have a data model for this basic use case in place, then you're not handling the best case scenario.

The CoL model does handle that obviously. But CoL deals with very heterogenous data from a wide range of sources (we prefer to avoid the term GSD as the sources are often not "global" and also not limited to "species"). Some do have ids, some do not at all. And what they represent we hardly ever know. It might be database records that change their id by some evil algorithm. It might be name identifiers, it might be "OTU" identifiers. We do not know. But even if we did have ids for OTUs from each and every source, they would never apply the same methods or rules for defining a concept. It's different between larger taxonomic groups, it might be a more molecular driven, it could be more or less phylogeny driven, it could be more of a splitter or lumper philosophy. It surely is never consistent. You could argue we do refer to the original source and can just forward the responsibility of the idea of a concept to them. But for an end use the CoL becomes even more heterogenuous and they would have a hard time understanding what that id means and if they can trust them for their purpose. My main reasons for having a genuine, consistent CoL identifier based on some agreed method are:

consistent identifiers which share the same meaning and behavior on inserts/updates/deletions that can be predicted
missing source ids: we will have to cater for missing source identifiers in any case
edited concepts: even though we borough from the sources the CoL still is an edited product and classifications and even to some degree the exact name and status can deviate from the source. For the extended catalogue that we are building we will even aggregate information from several sources for the same taxon, e.g. add a missing reference, more synonyms or vernacular names.
missing evidence: the CoL data model, and most often also the sources, lack information on the exact concept description. If a change happens the data seen by the end user might be exactly the same! What good is that to a user if he cannot tell apart the concepts from the data he gets? A changed id for a byte wise identical record hardly makes any sense.

It would be an option to use a hybrid approach and treat source differently. We could mark manually selected sources as having properly curated taxon identifiers and blindly follow their changes while others fall back to the default CoL provided ids.

from general.

mdoering commented on May 29, 2024

Reptile DB btw does not even expose their identifiers but instead recommends to link to their species pages by their search by name URL: http://reptile-database.reptarium.cz/species?genus=Trimerodytes&species=yunnanensis

from general.

deepreef commented on May 29, 2024

All: This is probably the most useful discussion I've had in months (if not years), because it actually feels like we're getting somewhere on a topic where wheels have been spinning and spinning. So fair warning and apology, much more to come.

Here I want to "see" the hypothetical from @dremsen and "raise" it into an actual, real-world example from my dive today. But first, a nit-pick:

to paraphrase a past Rich Pyle, he is saying "The specimen you caught is conspecific to the type specimen that I collected and is thus congruent with the concept I had when I caught my specimen and formally described it as a new taxon according to the rules of the code."

The first part of that is right, but the "congruent" part is a bit off. I would actually phrase it as:

"In my taxonomic opinion, tThe specimen you caught is conspecific to the type specimen that I collected, and ~~is thus congruent with~~ thus falls within the species-level concept I had when I ~~caught my specimen and~~ formally described ~~it as~~ a new taxon and established my specimen as the name-bearing type according to the rules of the code. Because no other earlier-established name-bearing type falls within my concept, then the correct name for my concept, and thus your specimen, is Chromis abyssus."

Just because I include your specimen within the same circumscription that I have in my head for C. abyssus doesn't mean that my concept is necessarily congruent with any other concept.

Anyway, getting back to the real-world example. This is a cropped frame grab from a video I took today:

It's in the same genus as the one in the @dremsen hypothetical (Chromis), but this one lives shallow and is probably the most common species of its genus in many places where it lives.

As an Ichthyologist born and raised on Hawaiian reefs, I have no trouble identifying this as Chromis agilis, described by Smith, 1960 (see Protonym in ZooBank). Don't take my word for it, check it out yourself.

Here it is in CoL.

CoL cites FishBase as the source database (GSD), where the online resource is. Going to that link reveals a distribution map showing broad distribution across the Indo-Pacific, and cites Allen 1991 as the "Main Reference". The record in WoRMS is derived from the same source.

Here is the record in ITIS. And here it is in Catalog of Fishes.

This is about as stable as taxonomy gets. At least it was... until last week, when this was published.

You can read the PDF if you want, but the short story is that Allen & Erdmann came to the conclusion that the Pacific populations represent a different species from those in the Indian Ocean. The type specimen of C. agilis is from the Seychelles, and it turns out that the taxonomy has been so stable since 1960, that no synonyms have ever been described from anywhere else (including the Pacific). So Allen & Erdmann decided to describe the new species [Chromis pacifica], based on a type specimen collected in the Coral Sea.

So... we have 33 TNUs in GNUB hooked into the Protonym for C. agilis:

Here's the challenge: How many OTUs are there? Is this the same as the number of CoLID values there should be? What additional information would you need to determine how many OTUs?

In my proposed pathway to salvation, I would have CoL harvest one more piece of information from the GSD source record for C. agilis: the TNU for FishBase's "Main Reference". It's in the list above as Chromis agilis Smith, 1960 sensu Allen 1991. It would take me about one weekend to hook all the existing CoLID values derived from FishBase into the corresponding GNUB Protonyms and FishBase "accepted" TNU values.

In the next cut of FishBase that is imported into CoL, you would note two things:

The addition of a new Protonym for Chromis pacifica Allen & Erdmann, 2020 sensu Allen & Erdmann 2020
A new "Main Reference" from FishBase in their record for C. agilis, pointing to Chromis agilis Smith, 1960 sensu Allen & Erdmann 2020

Thus, CoL would mint a new ID for C. pacifica (because it's a new name not previously imported into CoL), and would mint a new ID for C. agilis (because the "accepted" TNU from the source GSD changed).

In the long run, CoL would stop minting IDs altogether, and simple make statments along the lines of:

"With regard to Protonym Chromis agilis Smith, 1960 sensu Smith 1960, we defer to FishBase, who follows Chromis agilis Smith, 1960 sensu Allen, Erdmann 2020

You could cache a bunch of other metadata, of course, but the core service provided by CoL would be an endorsement of Chromis agilis Smith, 1960 sensu Allen, Erdmann 2020, determined via FishBase.

OK, more on the rest of @dremsen 's hypothetical post in a moment.

from general.

rdmpage commented on May 29, 2024

@mdoering

Reptile DB btw does not even expose their identifiers but instead recommends to link to their species pages by their search by name URL: http://reptile-database.reptarium.cz/species?genus=Trimerodytes&species=yunnanensis

Yes, which strikes me as bad design, made worse by the result if you use the URL that would have applied to this species before: http://reptile-database.reptarium.cz/species?genus=Sinonatrix&species=yunnanensis :

Species Sinonatrix yunnanensis was not found!
You can try find it as synonym, or use advanced search for searching it other way.

Given that Reptile DB knows that Sinonatrix yunnanensis is a synonym of Trimerodytes yunnanensis I don't understand why they can't just take you to Trimerodytes yunnanensis !?

Anyway, the integer ids I use are in the database dumps, and on first glance seem stable across releases of the Reptile DB checklist.

from general.

deepreef commented on May 29, 2024

@dremsen :

Chromis abyssus, Pyle 2001
Chromis hawkeswoodii, Remsen 2020 (heterotypic synonym)
Chromis margaritifer Fowler, 1946, sec Döring 2020, (chresonym) misidentification. #this is removed by the COL because it is not a 'real' synonym and should not be used to improve recall in search.

The only way that third one has any place in this discussion about circumscriptions/concepts is if you're going for the extrinsic approach of defining concepts/circumscriptions by enumerating lots and lots of individual organisms. We must have different interpretations of the meaning of "chresonym" (a term I've never liked, or used); because I do not see that third one as a chresonym. I don't even see it as playing a role in taxonomy. It's a dispute about the identification of a particular organism, which is a whole different thing from reasoning across taxon concepts.

Pyles concept of abyssus hasn't really changed. Pyle would assert that the holotype of hawkeswoodii is conspecific to abyssus. As we saw, he did that originally when my fish was just a specimen, prior to me doing all this work and turning it into a holotype. But since I went to the effort to enter my concept and associated specimen into the pantheon of taxonomy by following the nomenclatural rules, his concept has changed. It includes the original 'protonym:' a nomenclatural term which is really a proxy for Rich's novel concept. It also now contains Remsen's concept of hawkeswoodii and it's associated protonym and holotype. Protonym count went from one to two. Concept changed.

Exactly! This is what I was trying to get at. We don't know the relationship between Chromis abyssus, Pyle 2001 sec Pyle 2001 and Chromis abyssus, Pyle 2001 sec Pyle 2020. But we do know the relationship between Chromis abyssus, Pyle 2001 sec Pyle 2020 and Chromis abyssus, Pyle 2001 sec Remsen 2020 (assuming Remsen regarded C. abyssus as a valid and distinct species). That's because both Protonyms are referenced in both publications, so there is computable logic here. We don't know how Chromis abyssus, Pyle 2001 sec Pyle 2001 relates to the others unless we do some @nfranz -level sleuthing.

OK, I'll stop replying until I'm caught up reading.

from general.

ThierryBourgoin commented on May 29, 2024

Dear all, What an interring discussion! but difficult to follow getting in it today after 60 emails at my counter… A few quick thoughts even if I’m not sure, they are relevant to this discussion... 1. Taxonomy knowledge versus taxonomy usage. I think we need to separate taxonomy knowledge and taxonomy focal usages/practices of taxonomy (meeting specific needs). In the digital sphere the first needs a complete formalisation of what is a taxon, in the latter one tolerates/accommodates with some ambiguity because it serves/answers to local/focal purposes. If we could achieve the first, we could easily take what is needed in it to operate the second, but starting will focussing on the second, we‘ll have to reinvent the wheel each times from the local/focal objectives they want to serve. And thus we get the current landscape with lot of ways (tools, identifiers, practices …) to address taxonomy according to specific interests. I already mentioned this in CoL’s 2017 Wood hole meeting and discussed it with Rich and it is the spirit of the why my paper (still not published) about taxon formalisation that several of you have already read (or reviewed): the goal should be first how to transfer/translate taxonomy knowledge in the digital sphere even if trying meeting the needs is of course necessary. All this discussion shows well that a complete formalisation of what is a taxon and how to represent it in the digital is still a pending issue. 2. Approximation in terms. If we agree that a concept has (at least) 3 major properties (Name (N), Taxon defined by circumscription (Tc) and Taxon divined in intension (Ti), then the taxon concept we use in the new CoL represents only Tc, not the the complete taxon (T). This is a semantic shortcut we need to be aware of when looking for taxonomic identifiers beside CoL. The best that can address the new CoL is identifiers for N, Tc, N+tc but not for T that CoL does not address completly! There is no taxon concept in CoL. These 3 properties are clear I think for everyone but there is probably a forth one: its dynamic component (biological nature, conceptual perception). 3. Identifying what? Taxon identifiers are needed for the practice of taxonomy itself and for external usage of it. However having them, they fixe the taxon as a static entity while it is a involving concept from both its biological nature and its conceptual perception. I know quite nothing about identifiers but at least such an identifier should be able to address this paradox. Addressing the issue by any subset of N, Tc and Ti would fail to identify fully a taxon (but some subsets might be enough to answer specific needs). Using names only has shown to be inappropriate. Using circumscription (Tc) only (or with names) remains incomplete and addresses the concept of the taxon (not the taxon!) and part of the concept only. Its take into account its usage (children taxa) and is approached by capturing the taxonomic literature. However circumscription is not only about children taxa: each time a new ref is added, the concept of the taxa addressed is also changed because it encompasses all the biological attributes associated (i.e. taxon properties) with the specimens it groups: what encompassed Drosophila Fallén in 1830 is totally different of today in terms of children taxa of course, but also in terms of its distribution, ecology, … we are referring to the same biological entity (the taxa) but no longer to the same concept. Tracking the taxon name usage is not sufficient to formalize complety the taxon as a biological entity. Similarily, each time a taxa is moved in the classification, its definition (intension) changes (the topic of my paper): we have same biological entity (the taxon) but not the same concept. - [and by the way: tracking all changes by circumscription (= tracking all occurrences) is no less an enormous task than tracking all changes by intension (= tracking all classifications changes): if we agreed to do the first (=GBIF) we could also do the second] -. In other terms, for the usage of taxonomy we want/need taxon identifier for taxon as a biological entity, which is neither its name, neither its concept in any of its definitions But all of them are useful to represent part od the taxonomic knowledge in the digital sphere. As put in my paper, the Berendsohn notation "Aus bus Author, Date, sec. Author Date” (is Rich ‘sensu’ a similar one?) remains the best way for me to identify clearly a taxon, even if 'sec Author Date’ represent itself a concept (concept of classification) that also itself is not static, is also hierarchic (sec Author Date in a higher system of classification sec Author2, date), and evolves with progress of taxonomy/phylogeny knowledge. A taxon identifier focussing on such statement is probably the best solution we could have for the moment. 4. Species and higher levels. To meet the needs of the users of taxonomy the focus is mainly put on the species level which explains why taxon concept by circumscription has been favored. Semantic shortcuts even in this discussion were done for ‘species' instead of ‘taxon' and even Dave’s taxonomic story deals with species. The higher changes occur in the taxonomy, the higher the consequences are for the taxon species. Even if species is a just a rank as any other rank in taxonomy, and if in theory no differences should occur, a species level only reasoning might introduce some biases and I think that consequences of these higher changes needs to be better investigated: I’m not fully sure if new unexpected issues might occurs. BW, Th.

…

Le 24 août 2020 à 10:56, Roderic Page ***@***.***> a écrit : @mdoering <https://github.com/mdoering> Reptile DB btw does not even expose their identifiers but instead recommends to link to their species pages by their search by name URL: http://reptile-database.reptarium.cz/species?genus=Trimerodytes&species=yunnanensis <http://reptile-database.reptarium.cz/species?genus=Trimerodytes&species=yunnanensis> Yes, which strikes me as bad design, made worse by the result if you use the URL that would have applied to this species before: http://reptile-database.reptarium.cz/species?genus=Sinonatrix&species=yunnanensis <http://reptile-database.reptarium.cz/species?genus=Sinonatrix&species=yunnanensis> : Species Sinonatrix yunnanensis was not found! You can try find it as synonym, or use advanced search for searching it other way. Given that Reptile DB knows that Sinonatrix yunnanensis is a synonym of Trimerodytes yunnanensis I don't understand why they can't just take you to Trimerodytes yunnanensis !? Anyway, the integer ids I use are in the database dumps, and on first glance seem stable across releases of the Reptile DB checklist. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGZIOGZ2LUP3IJS2SNXHZPDSCITMRANCNFSM4DKBXVWA>.

from general.

Define objective rules for taxon concept identity about general HOT 129 OPEN

Comments (129)

Offtopic!

On topic, but not very constructive

Warning, off topic sensu my take on requirements for #6, includes themes repetative with previous spewing by me

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent