Giter Club home page Giter Club logo

general's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

general's Issues

Add official nomenclatural code lists

Both the zoological and botanical code maintain a list of the officially conserved, sanctioned, supressed or rejected names. These official resources should be included in the nomenclature of the system.

Botany:
Appendices II–VIII

Also see:
http://www.iapt-taxon.org/historic/Congress/IBC_2017/foundation.pdf
https://www.koeltz.com/product.aspx?pid=208366

Zoology:
Darwin Core Archive of the Official Lists and Indexes of Names in Zoology
DwC-archive (few issues left): https://github.com/gbif/iczn-lists

Define scope much more aggressively

From the sidelines this project already seems huge, far too driven by legacy concerns, and lacking clear input from "users". Once again we see data providers having a stake but not the users. I'm not privy to the original discussion about this project, but it seems to me that it would be nice to have at least three things:

Names linked to evidence

Most catalogues of names have few if any links to the actual evidence for those names, i.e. the literature. Most of the existing links are not digital. In fact, I would invert this problem as not being one of names + links to literature, but literature annotated with (amongst other thing) names. I suspect that both people and machines will gain more from access to the evidence. This is the intersection of BHL, the ever growing non-BHL digitised literature, and the nomenclators (why is ION not included, it has an order of magnitude more names that ZooBank?). For this to be effective, literature needs to be front and centre, the names are simply annotations. Let's stop rehashing 5x3 index cards. It's not the names that matter, it's the literature.

Names linked to names

Probably the single biggest frustration for users, and the one area I think this project should probably focus on. Synonyms drive people nuts. Botany does a good job of tracking objective synonyms (IPNI) and link to evidence for name change (albeit mostly old-skool text strings), zoology doesn't, and nobody really tracks subjective synonyms (other than simply listing them, without supporting evidence). Lots of scope for text mining to help discover both synonyms and the evidence for them.

Names in a tree (or other navigation structure)

It's not entirely clear to me that we need ••yet another** classification, especially one not based on evidence. Why not defer to the Open Tree of Life which is notionally evidence based (e.g., phylogenies). Leaving aside the classification/phylogeny distinction, lots of recent name changes will be driven by phylogenetic analyses. So why not delegate the classification to Open Tree? Is it not crazy and colossally wasteful for our field to have several projects all building all-encompassing taxonomic classifications?

Summary

Without trying to sound too cynical, we've been at this a while, and the same old issues keep coming around again and again. Doesn't this suggest that we're doing it wrong?

Why isn't this Wikispecies?

Playing Devil's advocate (AKA just being plain annoying), if we want an open, community-edited taxonomy, why don't we just use Wikispecies? Yes there are problems with Wikispecies, but:

  • it has active engagement with (at least some) taxonomists
  • it has a better and more detailed referencing than many taxonomic databases
  • it is semi-structured
  • it is open

Given this, why not try and build upon that? What is the compelling argument for not doing so?

CoL+: first, do no harm

I haven't noticed yet any comments about poor data quality in CoL+. It's like the smudge on the kitchen window — you know it's there, you will get around to cleaning it someday but it isn't a priority job. So the window remains smudged.

The "data quality" issues I'm talking about have nothing to do with whether a name use is backed with evidence, or whether an author misspelled a name or whether the correct authority has been cited or whether a URL is correct. CoL, like GBIF and WoRMS and many other aggregations, is riddled with low-level errors, like invalid data items, character encoding failures, incorrect formatting, duplications and truncated data items. As I wrote in a recent email to sp2000, these errors render a very large number of records completely useless for digital processing and tediously difficult for human processing.

Low-level errors first appeared in the aggregations mainly because incoming data were either not audited at all or were not audited carefully enough. They're still in the aggregations because existing data are either not audited at all or are not audited carefully enough.

They will persist in CoL+ unless this project uses data migration as an opportunity to clean data. The likelihood of this happening, to judge from the overall workplan and some correspondence I've had, is close to zero.

What's worse is that CoL+ may reduce further the quality of existing data. The Hippocratic principle "primum non nocere", "first, do no harm", was ignored when backbone taxonomies first appeared, to the horror of taxonomists and collection specialists. CoL+ hopes to fix the damage with explicit linking of names.

But CoL (and other aggregators) are guilty of data-mangling at lower levels, and with no plan to check for further data-mangling with the migration to CoL+, it will happen. In addition to character encoding failures and inadequate checking for duplicates, a surprisingly common stuff-up is truncation. As an example, CoL inconsistently truncates authors. There's an absolute limit of 100 characters, but shorter strings have also been truncated. In at least two source databases I'm aware of, the truncated strings appear full-length.

Every time I point out low-level data quality issues to aggregators I get the same answers (http://iphylo.blogspot.com.au/2016/04/guest-post-10-explanations-for-messy.html), which can be summarised as "We're aware of the problem, but..." Will CoL+ be any different?

Consequences of a single catalogue

Notes and discussion on the consequences of having a single Catalogue of Life dataset that distinguishes provisional information from the scrutinized one.

Having both data integrated into a single dataset avoids potential inconsistencies between two catalogues and easier access to additional, provisional information if needed. It should be straight forward to offer views on just the scrutinized content, just the provisional one e.g. to use in the review queues and a full view.

A single dataset means:

  1. we need a unified workflow. CoL is manually edited with sectors being synchronized on demand. The integration of provisional data should be mostly automated and happen as quick as possible whenever source changed. There is great potential for editorial conflicts
  2. quick additions of new taxa e.g. via Plazi only happen on the draft catalogue, not released versions
  3. releases are effecting both and need to be coordinated. CoL still plans monthly editions while we had considered weekly ones for the provisional catalogue
  4. we need clear rules how provisional data can augment existing taxa or provisional taxa and names be inserted into existing, scrutinized sectors:
    • different orthographies for a name from a nomenclator and a taxonomic authority can be flagged but probably not automatically be applied
    • augment names with an authorshio if missing
    • augment taxa with homotypic synonymy and basionym
    • fill classification gaps for missing ranks either because major ranks are unknown or source does not treat extended ranks like subfamily or tribe
    • augment names with original publication
    • augment vernacular names, e.g. more languages
    • augment with Plazi treatments
  5. stable taxon identifiers would have to be different for the 2 views if they are trying to capture taxon concepts based on synonymy and homotypic relations

Discuss if stable taxon ids are sufficient to support a single, dynamic taxonomic catalogue

To improve sustainability of the CoL infrastructure we need to run a single system that exposes all older editions of the CoL. Instead of serving each entire annual edition online we should consider:

  • offering annual editions for download as sql dumps
  • serve a single, dynamic taxonomic catalogue with resolvable older taxon concepts not part of the current edition.

Why I don't trust the Catalogue of Life

OK, a bit of a rant. From my perspective CoL is fundamentally broken. It asserts things with no supporting documentation, and favours "experts" over evidence. It assigns arbitrary "confidence" to sources.

To give a recent example, a user raised an issue on GBIF's bug tracking system, asking why two plant species that they thought were valid (and are listed as such by The Plant List and Plants of the World Online) are treated as synonyms by GBIF, see http://dev.gbif.org/issues/browse/PF-2856

Turns out these synonyms (Halenia hoppii as a synonym of Halenia weddeliana, and Gentianella dasyantha as a synonym of Gentianella selaginifolia) are provided by the Catalogue of Life, which in turn gets them from the “World Plants” database (?!) , which doesn't seem to exist online (the http://worldplants.webarchiv.kit.edu/ URL given in the Catalogue of Life for this database doesn't list the “World Plants” database).

Who decided that some project that doesn't have a web presence is suddenly the authority on plant names?

This is why I'm so keen on getting names linked to literature. I want evidence for these assertions, not some arbitrary unsupported pronouncement by an "expert".

Allow pCat to override sCat

There seems to be the need in rare cases to be able to override decisions in the sCat by the pCat. To what degree should this be allowed and what needs to exist to handle this both from IT and governance aspects?

  • Can we modify the selected sCat plants taxonomy in pCat? How to respond to feedback like "This node of the higher tree is wrong and needs to be ..."
  • who decides about the pCat content? a taxonomic board with col, gbif, eol, bold, bhl representatives? even wider community and we have clear rules what to accept?
  • A change in the higher tree like modifying plants families will potentially have a huge impact on the consistency of the pCat taxonomy. can we manage that?

Dataset types

Datasets differ in their focus and scope. The dataset metadata therefore offers currently a single vocabulary to classify them: https://github.com/Sp2000/colplus-backend/blob/master/colplus-api/src/main/java/org/col/api/vocab/DatasetType.java#L6

The dataset type can be important for human users to understand the scope of the dataset. It is also important to process data correctly, especially when using the data to assemble merged catalogues such as the CoL and the provisional catalogue. For the assembly purpose it is vital to know if a list of names are just names or indeed scrutinized taxa. By separating names from taxa in the ColDP format this is already apparent. Data published via DwC-A or ACEF does not have that distinction though so it is vital for those to know whether a dataset should be treated just as names or also as taxa.

The IPT BestPracticesChecklists resource already provides a good list of common dataset types:
https://github.com/gbif/ipt/wiki/BestPracticesChecklists#scope

Should we simply adopt it?
Is there any value in breaking down the dataset type classification into multiple properties? geographicScope could be an obvious candidate (none, regional, country, global)

Document complete DwC-A format

Review current standard format, DWC-A, WFO profile, TCS in order to define a simple but comprehensive exchange format based on dwc archives that meets requirements.

Normalized references and acts will need special attention

Specimen data

Should the Clearinghouse of CoL+ deal with specimen data, especially types? It will cause significant work to curate and deal with specimen information so it should be clear what is gained.

What additional use cases can we support when having available specimen metadata as opposed to just knowing the basionym/protonym of a name acting as a proxy to the type? Or dynamically link to GBIF for a specimen (with image) search?

Is it important to know the catalogue number, collection, collector, type location or sth else about a type?

Document API as RAML

All APIs of the projects should be extensively documented for public and internal use.
The initial design and the continous documentation update will be done using the RESTful API Modeling Language (RAML)

add a fossil geologic time range field

For palaeo taxa it is key to know the geologic time the organism was known to have lived, i.e. the range of geologic times it is known from the fossil record.

Implementation would be best based on a start and an end field (integer or double) representing million years (Ma). Input could then also be parsed from known geological times like "Trias, Juras", but it allows to offer all kinds of range searching

How to detect chresonyms?

Some resources, e.g. the Reptile DB, contain many chresonyms for a name which the CoL would like to exclude. Manually flagging these names is very time consuming and not really feasable on this scale. What rules can we apply to discover the real name and flag chresonyms to discard them in the assembly process?

Document the current GSD assembly process

Document the current GSD assembly process including the editorial process and processes across the multiple locations where assembly occurs. Document how IRMNG genera and regional lists get integrated

Manage a backlog for the Catalogue of Life product

Development of added functionality for the Catalogue of Life as product may falll out of scope of the current 2 year CoL+ project. The information systems group as advisory to the Catalogue of Life global team could manage a backlog of items related to the Catalogue of Life product. Such an overview could aid in providing clarity for what additional funding/in-kind contributions could be sought

Activities may include:

  • firming up past backlog items
  • creating clarity what is out of scope for the clearinghouse infrastructure for names and taxonomy presently in the 2 year CoL+ project
  • logging items to be discussed in the CoL governance

Should unavailable names be classified in the provisional Catalogue

Should unavailable names in the provisional catalogue be linked to the taxonomic tree in some way? Even though they are not proper names it is probably helpful to attach some taxonomic context to them. In many cases they can probably just be treated as synonyms, but for others there is no clear accepted name. But it should be possible to link them at least to a genus, family or other higher group to give them some context. If classic synonyms are not applicable do we need a new taxonomic status for such names?

For example 3 Abies names, one invalid, one illegitimate and one valid but all considered to be within Pinaceae: http://www.tropicos.org/Name/50231084?tab=homonyms
These can probably be treated as synonyms.

CoL name rendering rules

How should a scientific name be rendered in the Catalogue of Life?
There are different code recommendations for Botanical, Zoological, Bacterial & Virus names. There is also considerable different practise between major zoological groups and data providers.

Document the desirable format and rules that should always be applied for each code.
To be considered:

  • infraspecific rank marker standardization
  • infrageneric names
  • subgenus classification of species
  • authorships
    • combination & basionym authors
    • years
    • full vs abbreviated authors
    • full authors with initials before or after lastname, avoid commas
    • et al.
    • and vs &

What management classification to use for the CoL

The current CoL uses a higher classification that evolved over time and is in many groups defined by the respective GSD. This causes for example all algae classes and even phyla to disappear when Algaebase was removed.

It is also unclear what relationship the classification in use in the CoL to the officially published classification papers by Ruggiero et al have. They do not match up.

Discuss about the desired future management classification of the CoL.

  1. Should it follow the Ruggiero paper?
  2. Should it be managed continuously by some advisor group similar to a GSD?
  3. Should it go down to family/order level in all groups or should GSD be able to define the classification?

See also the official information from CoL:
http://www.catalogueoflife.org/annual-checklist/2017/info/hierarchy

And the related issue #30 about which (higher) ranks the CoL should include.

Nomenclatural equality for monomials at different ranks

Especially suprageneric monomials can be used at varies ranks.

  • Should these be considered different names in the nomenclatural sense or is that merely a different taxonomic usage of the same name?
  • Is there a difference between botany and zoology?
  • Is it sufficient to distinct family group names and genus group names and ignore the exact rank? Considering we also want to track unavailable names

For example Dianthera exists twice as a genus in IPNI:
http://beta.ipni.org/?f=f_generic&q=Dianthera

And both as a section and subsection both from the same publication, but apparently different pages: http://beta.ipni.org/?f=f_infrageneric&q=Dianthera

So this clearly demands for unique ranks in subgeneric names.
I presume we have the same case for suprageneric names

Define rules for scientific name identity

Define clear rules what exactly makes a name the same name. A scientific name in the sense of the Clearinghouse of CoL+ has an identity and a unique, stable identifier. If possible these identifiers should be reusing ids issued by the participating nomenclators like IPNI.


A name includes it’s authorship. Two homonyms with different authors therefore represent two different name entities.

The same name can usually be represented by many different strings which we refer to as lexical variations. For each name a standard representation, the canonical form, exists. Lexical variations exist for various reasons. Author spelling, transliterations, epithet gender, additional infrageneric or infraspecific indications or cited species authors in infraspecific names are common reasons. Listed here are 7 distinct names with some of their string representations:

 1. Aus bus Linnaeus 1758
    - Aus bus Linn. 1758
    - Aus bus Linn 1758
    - Aus bus L.
    - Aus ba Linn 1758.
    - Aus (Hus) bus L.

 2. Xus bus (Linn, 1758)
    - Xus bus (Linn) Smith

 3. Xus cus Smith, 1850
    - Xus cus Sm.

 4. Xus cus Jones 1900

 5. Xus bus cus Smith 1850
    - Xus bus subsp. cus Smith 1850

 6. Xus dus Pyle 2000

 7. Foo bar var. lion Smith 1850
    - Foo bar L. var. lion Smith
    - Foo bar subsp. dar var. lion Smith 1850
    - Foo bar Lin. subsp. dar Mill. var. lion Smith 1850

New names (sp./gen. nov.), new recombinations of the same epithet (comb. nov.), a name at a new rank (stat. nov.) or replacement names (nom. nov.) are all treated as distinct names.

Open questions to be addressed:

  • How to treat various spelling variations. Should (some) misspellings, different transliterations, ligatures, umlauts or a wrong gender ending be considered a different name or just a lexical variant?
  • Is the (intended) publication of a name a requirement?
  • What about chresonyms?
  • Is an ambiregnal name published both under the botanical and zoological code a single or two names?

We need to capture examples of the various cases.

Suggest and review nomenclatural sources

Initial imports from sources relevant to nomenclature should be considered. Please suggest relevant nomenclatural sources as comments in this issue. Key information is the name with authorship, the literature reference ideally with a DOI or link, type material and the basionym/protonym information

Relevant sources that should be synced continously:

Other potential sources

Is common name transliteration needed?

Current CoL and the ACEF format defines a transliteration property of a common name. It is likely to be used for not latin scripts like Chinese. Is this a truely useful information that should be kept in CoL+? It seems there is also a lot of wrong data in there, e.g. english common names as a translation of a vernacular (see dutch name below)

Example: http://www.catalogueoflife.org/annual-checklist/2017/details/species/id/13ea698ddd3fe9844666f634ece60a23/common/92051241d63cfbc89630c6a3ce9bb508

Common name Transliteration Language Country
شبّوط أسود - Arabic -
Zwarte carper Black carp Dutch Netherlands
湄公雙孔魚 Mei gong shuang kong yu Mandarin Chinese China

Decide on domain(s)

The nomenclator, taxonomic catalogue and editor ultimately need a domain to live at.
The current catalogueoflife.org should better not be overloaded.

Options are to run a col.plus domain for now that can be used for all components, e.g.:

  • names.col.plus
  • api.col.plus
  • www.col.plus for the catalog portal

Or we use different domains for the unified nomenclator and the taxonomic pieces.
Posting ideas as separate comments so we can count likes individually

How to deal with literature citations

Literature plays a crucial role especially in the nomenclator. This issue follows up on #22 and wants to discuss how to:

  1. model references
  2. parse references
  3. retrieve a DOI or other identifier
  4. establish links to view the actual text
  5. and list valuable sources for literature of names

References primarily exist to allow users to trace back where information originally came from. The ideal situation would allow the actual text to be viewed so the information can be verified and/or additional information extracted.

I would even argue links to the actual text - whether this is a DOI to an (open access) journal, a plain URL to a webpage, a pdf or BHL pages - are the key here. Not so much the reference metadata. The year of publication is very useful to present data in chronological order and important for the nomenclatural codes. But further parsing and even normalization of references does not seem to gain much benefits. Finding a DOI or BHL page probably requires some degree of parsing, but is that the only reason?

OTOL as an application infrastructure for CoLplus

I completely concur with the ideas re OTOL from @rdmpage. It seems to me that a vast majority of functionality (data modelling, tooling, etc.) could re-use OTOL's efforts, which I consider to be very well implemented as these things go. The underlying core data structures should be the same (graphs, studies). OTOL already has means to accession studies and graphs (= GSDs or individual databases and their classifications/checklists). OTOL has tools like the OTU mapper (potentially extractable - https://github.com/SpeciesFileGroup/otu_mapping_widget), which is a basic component to a Editorial (combining GSDs) interfaces, i.e. mapping known to unknown. Substitute branch lengths for NOMEN relationships (or something similar, I'm just using it as an example), and you have most of the expressiveness you need for the nomencator (facts, not opinions). OTOL's collections are the obvious basis/proxy for opinions (taxon concepts).

Distribution data in the CoL

How should the CoL deal with distribution data in the future.
Currently distribution data is:

  1. not present in all groups
  2. often for a single species only regionally covered and not globally complete - lacking an indication whether the distribution should be considered complete
  3. sometimes structured sometimes just plain text

Discuss whether we simply continue as is or:

  1. only allow structured distributions with a distinct record for each area (according to various extendible gazeteers, e.g. see http://rs.gbif.org/areas/). This allows presentations and queries as maps and interchange with the World Flora Online
  2. some indication whether the given coverage is considered globally complete

Consider defining CoL plus simply as a set of API endpoints

Talking with others the idea came up to have the project proceed as follows:

  1. Interested parties defined API endpoints, we do this with RAML, or simply add a example request as an issue header, and include the JSON/YAML response that is expected in the body. This puts the screws on folks like @rdmpage, @yroskov etc. to show exactly the types of requests they want.

  2. The API endpoints are aggregated by @mdoering or other technical folk.

  3. Everyone tags their favourite endpoints.

  4. Sets of endpoints become proxies for "Apps" (like CoL plus).

  5. Implementation begins. The API acts as unit tests (critical for this effort).

  6. For each set of "favourited" endpoints we have a travis CI call that shows how many unit tests pass. When @yroskov and the CoL teams set of API endpoints pass (for example), then and only then, are GUIs, interfaces, etc. written.

Establish taxonomic API

A new taxonomic API should enable:

  • external use of the taxonomic catalogue, i.e searching & browsing
  • resolve taxon ids, even if historic
  • taxon concept matching assigning a taxon ID based on rules per #6

Evaluate best way to migrate the PHP portal

Evaluate what’s the best way forward to migrate the existing CoL portal to the new infrastructure.
Finding the balance between the least amount of resources needed to keep essential features of the current portal working on the new infrastructure, i.e. db model or webservices. Document key requirements for a new portal, including:

  • resolution of taxon IDs (can we drop LSIDs?), see #6
  • serving data of historic annual editions, see #12
  • existing URLs to be kept
  • is internationalisation truely needed from the beginning? see #14

Based on that there are 2 main options to consider:

  1. Update the current PHP portal code to
    • Use the new database model for SQL queries or better instead the new API
    • Deal with deleted taxa in portal
    • Resolve stable taxon ids
    • Show and mark provisional data
  2. Rewrite the portal in the same JS framework used for the Nomenclator
    • Reuse the existing portal url layout, html & css to save efforts
    • consistently use new API
    • Consider if internationalisation is needed

Extend CoL ranks

The current CoL uses a fixed set of ranks that limits its use and misses some important groups.
It is suggested to extend the ranks to cover the following, bold=existing CoL ranks:

kingdom
phylum
subphylum
class
subclass
order
suborder
superfamily
family
subfamily
tribe
genus
subgenus

It is not required to use all ranks in every group, but use tribes and subfamilies for example as appears useful within the respective group.

See also the following issues in GBIF as a background:
http://dev.gbif.org/issues/browse/POR-2781
http://dev.gbif.org/issues/browse/POR-325

How does CoL handle taxonomic remarks in names

What is the policy for names in CoL when it comes to taxonomic remarks such as "non Author X", "sensu latu", "s.str.", "auct amer", "sensu X" or any other annotation found in names that indicates a taxon concept to some degree and is not a proper name in the strict nomenclatural sense?

For example there are these 2 names in the CoL:

  • accepted: Abalistes stellaris (Bloch & Schneider, 1801)
  • synonym: Abalistes stellaris (non Bloch & Schneider, 1801)

Should these both exist or the synonym be filtered out?

Organise names using the CoL hierarchy

We need something that organises the nomenclatural data if there is no taxonomic scrutiny available. The CoL hierarchy reaches down to families and with the provisional addition of IRMNG even to most genera. Try to be the user of our own product and organize names in the nomenclator provisionally using the latest CoL

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.