catalogueoflife / general Goto Github PK
View Code? Open in Web Editor NEWThe Catalogue of Life
The Catalogue of Life
There are 2 Drupal based sites that GBIF would prefer to be moved to a static site or similar:
http://sp2000.org/
http://www.catalogueoflife.org/
GBIF has just replaced its own Drupal environment with static sites and/or outsourced the CMS to https://www.contentful.com/.
Consider if:
A plan will be developed to frame governance, content solicitation from new sources, rules regarding the editorial process.
CoL gets feedback from users with evidence that a taxon actually is extinct, e.g. the Dodo.
Should there be an editorial decision to change a taxon flag persistently across source updates?
Both the zoological and botanical code maintain a list of the officially conserved, sanctioned, supressed or rejected names. These official resources should be included in the nomenclature of the system.
Botany:
Appendices II–VIII
Also see:
http://www.iapt-taxon.org/historic/Congress/IBC_2017/foundation.pdf
https://www.koeltz.com/product.aspx?pid=208366
Zoology:
Darwin Core Archive of the Official Lists and Indexes of Names in Zoology
DwC-archive (few issues left): https://github.com/gbif/iczn-lists
From the sidelines this project already seems huge, far too driven by legacy concerns, and lacking clear input from "users". Once again we see data providers having a stake but not the users. I'm not privy to the original discussion about this project, but it seems to me that it would be nice to have at least three things:
Most catalogues of names have few if any links to the actual evidence for those names, i.e. the literature. Most of the existing links are not digital. In fact, I would invert this problem as not being one of names + links to literature, but literature annotated with (amongst other thing) names. I suspect that both people and machines will gain more from access to the evidence. This is the intersection of BHL, the ever growing non-BHL digitised literature, and the nomenclators (why is ION not included, it has an order of magnitude more names that ZooBank?). For this to be effective, literature needs to be front and centre, the names are simply annotations. Let's stop rehashing 5x3 index cards. It's not the names that matter, it's the literature.
Probably the single biggest frustration for users, and the one area I think this project should probably focus on. Synonyms drive people nuts. Botany does a good job of tracking objective synonyms (IPNI) and link to evidence for name change (albeit mostly old-skool text strings), zoology doesn't, and nobody really tracks subjective synonyms (other than simply listing them, without supporting evidence). Lots of scope for text mining to help discover both synonyms and the evidence for them.
It's not entirely clear to me that we need ••yet another** classification, especially one not based on evidence. Why not defer to the Open Tree of Life which is notionally evidence based (e.g., phylogenies). Leaving aside the classification/phylogeny distinction, lots of recent name changes will be driven by phylogenetic analyses. So why not delegate the classification to Open Tree? Is it not crazy and colossally wasteful for our field to have several projects all building all-encompassing taxonomic classifications?
Without trying to sound too cynical, we've been at this a while, and the same old issues keep coming around again and again. Doesn't this suggest that we're doing it wrong?
Playing Devil's advocate (AKA just being plain annoying), if we want an open, community-edited taxonomy, why don't we just use Wikispecies? Yes there are problems with Wikispecies, but:
Given this, why not try and build upon that? What is the compelling argument for not doing so?
Currently an annual DVD of the CoL checklist is produced. This is done on the basis of the monthly edition of March each year. The current process is that the data is downloaded from the 'production' database. After this download it is a manual process to create a DVD. The annual checklist is available through the catalogue of Life portal: http://www.catalogueoflife.org/annual-checklist/2017/
I haven't noticed yet any comments about poor data quality in CoL+. It's like the smudge on the kitchen window — you know it's there, you will get around to cleaning it someday but it isn't a priority job. So the window remains smudged.
The "data quality" issues I'm talking about have nothing to do with whether a name use is backed with evidence, or whether an author misspelled a name or whether the correct authority has been cited or whether a URL is correct. CoL, like GBIF and WoRMS and many other aggregations, is riddled with low-level errors, like invalid data items, character encoding failures, incorrect formatting, duplications and truncated data items. As I wrote in a recent email to sp2000, these errors render a very large number of records completely useless for digital processing and tediously difficult for human processing.
Low-level errors first appeared in the aggregations mainly because incoming data were either not audited at all or were not audited carefully enough. They're still in the aggregations because existing data are either not audited at all or are not audited carefully enough.
They will persist in CoL+ unless this project uses data migration as an opportunity to clean data. The likelihood of this happening, to judge from the overall workplan and some correspondence I've had, is close to zero.
What's worse is that CoL+ may reduce further the quality of existing data. The Hippocratic principle "primum non nocere", "first, do no harm", was ignored when backbone taxonomies first appeared, to the horror of taxonomists and collection specialists. CoL+ hopes to fix the damage with explicit linking of names.
But CoL (and other aggregators) are guilty of data-mangling at lower levels, and with no plan to check for further data-mangling with the migration to CoL+, it will happen. In addition to character encoding failures and inadequate checking for duplicates, a surprisingly common stuff-up is truncation. As an example, CoL inconsistently truncates authors. There's an absolute limit of 100 characters, but shorter strings have also been truncated. In at least two source databases I'm aware of, the truncated strings appear full-length.
Every time I point out low-level data quality issues to aggregators I get the same answers (http://iphylo.blogspot.com.au/2016/04/guest-post-10-explanations-for-messy.html), which can be summarised as "We're aware of the problem, but..." Will CoL+ be any different?
Notes and discussion on the consequences of having a single Catalogue of Life dataset that distinguishes provisional information from the scrutinized one.
Having both data integrated into a single dataset avoids potential inconsistencies between two catalogues and easier access to additional, provisional information if needed. It should be straight forward to offer views on just the scrutinized content, just the provisional one e.g. to use in the review queues and a full view.
A single dataset means:
To improve sustainability of the CoL infrastructure we need to run a single system that exposes all older editions of the CoL. Instead of serving each entire annual edition online we should consider:
OK, a bit of a rant. From my perspective CoL is fundamentally broken. It asserts things with no supporting documentation, and favours "experts" over evidence. It assigns arbitrary "confidence" to sources.
To give a recent example, a user raised an issue on GBIF's bug tracking system, asking why two plant species that they thought were valid (and are listed as such by The Plant List and Plants of the World Online) are treated as synonyms by GBIF, see http://dev.gbif.org/issues/browse/PF-2856
Turns out these synonyms (Halenia hoppii as a synonym of Halenia weddeliana, and Gentianella dasyantha as a synonym of Gentianella selaginifolia) are provided by the Catalogue of Life, which in turn gets them from the “World Plants” database (?!) , which doesn't seem to exist online (the http://worldplants.webarchiv.kit.edu/ URL given in the Catalogue of Life for this database doesn't list the “World Plants” database).
Who decided that some project that doesn't have a web presence is suddenly the authority on plant names?
This is why I'm so keen on getting names linked to literature. I want evidence for these assertions, not some arbitrary unsupported pronouncement by an "expert".
There seems to be the need in rare cases to be able to override decisions in the sCat by the pCat. To what degree should this be allowed and what needs to exist to handle this both from IT and governance aspects?
Datasets differ in their focus and scope. The dataset metadata therefore offers currently a single vocabulary to classify them: https://github.com/Sp2000/colplus-backend/blob/master/colplus-api/src/main/java/org/col/api/vocab/DatasetType.java#L6
The dataset type can be important for human users to understand the scope of the dataset. It is also important to process data correctly, especially when using the data to assemble merged catalogues such as the CoL and the provisional catalogue. For the assembly purpose it is vital to know if a list of names are just names or indeed scrutinized taxa. By separating names from taxa in the ColDP format this is already apparent. Data published via DwC-A or ACEF does not have that distinction though so it is vital for those to know whether a dataset should be treated just as names or also as taxa.
The IPT BestPracticesChecklists resource already provides a good list of common dataset types:
https://github.com/gbif/ipt/wiki/BestPracticesChecklists#scope
Should we simply adopt it?
Is there any value in breaking down the dataset type classification into multiple properties? geographicScope could be an obvious candidate (none, regional, country, global)
See MODEL for current state
Document the essential current webservices and their usage in order to assess consequences of deprecation. Decide whether the existing list matching and 4D4life services need to be migrated.
Review current standard format, DWC-A, WFO profile, TCS in order to define a simple but comprehensive exchange format based on dwc archives that meets requirements.
Normalized references and acts will need special attention
Discuss if internationalisation of the CoL portal is crucial from the start or if this can be added in a later stage post CoL+. Document current portal URLs to be kept in a potential new implementation
Should the Clearinghouse of CoL+ deal with specimen data, especially types? It will cause significant work to curate and deal with specimen information so it should be clear what is gained.
What additional use cases can we support when having available specimen metadata as opposed to just knowing the basionym/protonym of a name acting as a proxy to the type? Or dynamically link to GBIF for a specimen (with image) search?
Is it important to know the catalogue number, collection, collector, type location or sth else about a type?
We like to expose all previous annual releases of the CoL and need to get hold of the data.
The annual releases since 2000 are offered for download, but come bundles with the application and use a binary mysql distribution which is not very compatible. Unless sql dumps exist already these should be converted to portable sql dumps.
http://www.catalogueoflife.org/content/annual-checklist-archive
All APIs of the projects should be extensively documented for public and internal use.
The initial design and the continous documentation update will be done using the RESTful API Modeling Language (RAML)
For palaeo taxa it is key to know the geologic time the organism was known to have lived, i.e. the range of geologic times it is known from the fossil record.
Implementation would be best based on a start and an end field (integer or double) representing million years (Ma). Input could then also be parsed from known geological times like "Trias, Juras", but it allows to offer all kinds of range searching
Some resources, e.g. the Reptile DB, contain many chresonyms for a name which the CoL would like to exclude. Manually flagging these names is very time consuming and not really feasable on this scale. What rules can we apply to discover the real name and flag chresonyms to discard them in the assembly process?
The RAML API specs are rendered as static html pages using https://github.com/mdoering/raml-slate which are then hosted at https://api.col.plus/
Pilot integration of an external taxonomic editor with existing (willing) partner, e.g. http://taxonworks.org/
Piloting could be used to refine the taxonomic API used for integration
Document the current GSD assembly process including the editorial process and processes across the multiple locations where assembly occurs. Document how IRMNG genera and regional lists get integrated
Development of added functionality for the Catalogue of Life as product may falll out of scope of the current 2 year CoL+ project. The information systems group as advisory to the Catalogue of Life global team could manage a backlog of items related to the Catalogue of Life product. Such an overview could aid in providing clarity for what additional funding/in-kind contributions could be sought
Activities may include:
Things like "geological period" and "synonym" don't belong together at the same level, for example.
@dimus has a nice three tiered model that he defines as a way to organize data. I think things need to be crystal clear there.
Should unavailable names in the provisional catalogue be linked to the taxonomic tree in some way? Even though they are not proper names it is probably helpful to attach some taxonomic context to them. In many cases they can probably just be treated as synonyms, but for others there is no clear accepted name. But it should be possible to link them at least to a genus, family or other higher group to give them some context. If classic synonyms are not applicable do we need a new taxonomic status for such names?
For example 3 Abies names, one invalid, one illegitimate and one valid but all considered to be within Pinaceae: http://www.tropicos.org/Name/50231084?tab=homonyms
These can probably be treated as synonyms.
How should a scientific name be rendered in the Catalogue of Life?
There are different code recommendations for Botanical, Zoological, Bacterial & Virus names. There is also considerable different practise between major zoological groups and data providers.
Document the desirable format and rules that should always be applied for each code.
To be considered:
managed by VLIZ
http://www.vliz.be/en/catalogue?module=ref&show=955
The current CoL uses a higher classification that evolved over time and is in many groups defined by the respective GSD. This causes for example all algae classes and even phyla to disappear when Algaebase was removed.
It is also unclear what relationship the classification in use in the CoL to the officially published classification papers by Ruggiero et al have. They do not match up.
Discuss about the desired future management classification of the CoL.
See also the official information from CoL:
http://www.catalogueoflife.org/annual-checklist/2017/info/hierarchy
And the related issue #30 about which (higher) ranks the CoL should include.
Especially suprageneric monomials can be used at varies ranks.
For example Dianthera
exists twice as a genus in IPNI:
http://beta.ipni.org/?f=f_generic&q=Dianthera
And both as a section and subsection both from the same publication, but apparently different pages: http://beta.ipni.org/?f=f_infrageneric&q=Dianthera
So this clearly demands for unique ranks in subgeneric names.
I presume we have the same case for suprageneric names
Define clear rules what exactly makes a name the same name. A scientific name in the sense of the Clearinghouse of CoL+ has an identity and a unique, stable identifier. If possible these identifiers should be reusing ids issued by the participating nomenclators like IPNI.
A name includes it’s authorship. Two homonyms with different authors therefore represent two different name entities.
The same name can usually be represented by many different strings which we refer to as lexical variations. For each name a standard representation, the canonical form, exists. Lexical variations exist for various reasons. Author spelling, transliterations, epithet gender, additional infrageneric or infraspecific indications or cited species authors in infraspecific names are common reasons. Listed here are 7 distinct names with some of their string representations:
1. Aus bus Linnaeus 1758
- Aus bus Linn. 1758
- Aus bus Linn 1758
- Aus bus L.
- Aus ba Linn 1758.
- Aus (Hus) bus L.
2. Xus bus (Linn, 1758)
- Xus bus (Linn) Smith
3. Xus cus Smith, 1850
- Xus cus Sm.
4. Xus cus Jones 1900
5. Xus bus cus Smith 1850
- Xus bus subsp. cus Smith 1850
6. Xus dus Pyle 2000
7. Foo bar var. lion Smith 1850
- Foo bar L. var. lion Smith
- Foo bar subsp. dar var. lion Smith 1850
- Foo bar Lin. subsp. dar Mill. var. lion Smith 1850
New names (sp./gen. nov.), new recombinations of the same epithet (comb. nov.), a name at a new rank (stat. nov.) or replacement names (nom. nov.) are all treated as distinct names.
Open questions to be addressed:
We need to capture examples of the various cases.
Initial imports from sources relevant to nomenclature should be considered. Please suggest relevant nomenclatural sources as comments in this issue. Key information is the name with authorship, the literature reference ideally with a DOI or link, type material and the basionym/protonym information
Relevant sources that should be synced continously:
Other potential sources
Current CoL and the ACEF format defines a transliteration property of a common name. It is likely to be used for not latin scripts like Chinese. Is this a truely useful information that should be kept in CoL+? It seems there is also a lot of wrong data in there, e.g. english common names as a translation of a vernacular (see dutch name below)
Common name | Transliteration | Language | Country |
---|---|---|---|
شبّوط أسود | - | Arabic | - |
Zwarte carper | Black carp | Dutch | Netherlands |
湄公雙孔魚 | Mei gong shuang kong yu | Mandarin Chinese | China |
The nomenclator, taxonomic catalogue and editor ultimately need a domain to live at.
The current catalogueoflife.org should better not be overloaded.
Options are to run a col.plus domain for now that can be used for all components, e.g.:
Or we use different domains for the unified nomenclator and the taxonomic pieces.
Posting ideas as separate comments so we can count likes individually
Literature plays a crucial role especially in the nomenclator. This issue follows up on #22 and wants to discuss how to:
References primarily exist to allow users to trace back where information originally came from. The ideal situation would allow the actual text to be viewed so the information can be verified and/or additional information extracted.
I would even argue links to the actual text - whether this is a DOI to an (open access) journal, a plain URL to a webpage, a pdf or BHL pages - are the key here. Not so much the reference metadata. The year of publication is very useful to present data in chronological order and important for the nomenclatural codes. But further parsing and even normalization of references does not seem to gain much benefits. Finding a DOI or BHL page probably requires some degree of parsing, but is that the only reason?
I completely concur with the ideas re OTOL from @rdmpage. It seems to me that a vast majority of functionality (data modelling, tooling, etc.) could re-use OTOL's efforts, which I consider to be very well implemented as these things go. The underlying core data structures should be the same (graphs, studies). OTOL already has means to accession studies and graphs (= GSDs or individual databases and their classifications/checklists). OTOL has tools like the OTU mapper (potentially extractable - https://github.com/SpeciesFileGroup/otu_mapping_widget), which is a basic component to a Editorial (combining GSDs) interfaces, i.e. mapping known to unknown. Substitute branch lengths for NOMEN relationships (or something similar, I'm just using it as an example), and you have most of the expressiveness you need for the nomencator (facts, not opinions). OTOL's collections are the obvious basis/proxy for opinions (taxon concepts).
How should the CoL deal with distribution data in the future.
Currently distribution data is:
Discuss whether we simply continue as is or:
Define rules for a stable taxonID. Understanding when a taxon changes sufficiently to warrant an identifier change
Talking with others the idea came up to have the project proceed as follows:
Interested parties defined API endpoints, we do this with RAML, or simply add a example request as an issue header, and include the JSON/YAML response that is expected in the body. This puts the screws on folks like @rdmpage, @yroskov etc. to show exactly the types of requests they want.
The API endpoints are aggregated by @mdoering or other technical folk.
Everyone tags their favourite endpoints.
Sets of endpoints become proxies for "Apps" (like CoL plus).
Implementation begins. The API acts as unit tests (critical for this effort).
For each set of "favourited" endpoints we have a travis CI call that shows how many unit tests pass. When @yroskov and the CoL teams set of API endpoints pass (for example), then and only then, are GUIs, interfaces, etc. written.
A new taxonomic API should enable:
Evaluate what’s the best way forward to migrate the existing CoL portal to the new infrastructure.
Finding the balance between the least amount of resources needed to keep essential features of the current portal working on the new infrastructure, i.e. db model or webservices. Document key requirements for a new portal, including:
Based on that there are 2 main options to consider:
The current CoL uses a fixed set of ranks that limits its use and misses some important groups.
It is suggested to extend the ranks to cover the following, bold=existing CoL ranks:
kingdom
phylum
subphylum
class
subclass
order
suborder
superfamily
family
subfamily
tribe
genus
subgenus
It is not required to use all ranks in every group, but use tribes and subfamilies for example as appears useful within the respective group.
See also the following issues in GBIF as a background:
http://dev.gbif.org/issues/browse/POR-2781
http://dev.gbif.org/issues/browse/POR-325
What is the policy for names in CoL when it comes to taxonomic remarks such as "non Author X", "sensu latu", "s.str.", "auct amer", "sensu X" or any other annotation found in names that indicates a taxon concept to some degree and is not a proper name in the strict nomenclatural sense?
For example there are these 2 names in the CoL:
Should these both exist or the synonym be filtered out?
We need something that organises the nomenclatural data if there is no taxonomic scrutiny available. The CoL hierarchy reaches down to families and with the provisional addition of IRMNG even to most genera. Try to be the user of our own product and organize names in the nomenclator provisionally using the latest CoL
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.