trias-project / daisie-checklist Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 2.0 25.12 MB

🇪🇺 DAISIE - Inventory of alien species in Europe

Home Page: https://trias-project.github.io/daisie-checklist/

License: MIT License

CSS 100.00%

checklist dataset gbif invasive-species oscibio r rstats

daisie-checklist's People

Contributors

Watchers

Forkers

stijnvanhoey marinagolivets

daisie-checklist's Issues

Some taxa are duplicated

The following taxa are duplicated in the taxon core:
9741 ,9813 ,9828 ,10002, 10378 ,11143 ,11158 ,11191 ,11200 ,11646 ,12114 ,12116 ,12117
(due to the presence/lack in the authorship field)

We could filter them out of the taxon core, but the problem is that there could be information linked to these taxa in other extension files. If we remove them, people won't be able to search for taxon information. So for now I would keep them in the checklist, and if I have time, I could check whether or not these duplicated taxa re-occur in the extensions

Wrong phylum and order information

In line with #17, I discovered some inconsistencies in the phylum and order information:

Nematoda` is a phylum, not a class
Bacteria is a kingdom, not a phylum

More inconsistenties will appear when scanning order or family information, but this is outside the scope of the mapping process.

taxonRank mappig

The information for taxonRank is contained in the field sp_subtaxon_rank.
taxonRank has a [controlled vocabulary](
This is what I suggest for the mapping:

sp_subtaxon_rank	taxonRank
var.	variety
hyb.	[1]
subsp.	subspecies
agg.	[2]
subspecies	subspecies
x	[3]
Cytosporina sp.	[4]
var	variety
f.	form
f. sp.	[5]
sp.	species
Crous	[4]

[1] with respect to the mapping of hybrids, see this issue. I will integrate this accordingly.

[2] agg. stands for aggregate. I found these ranks for some genera and some species. I suspect that an aggregate for a species name is called speciesAggregate. However, I could not find an equivalent for genera. Could I just consider them as genera? @qgroom suggestions?

[4] errors?

[5] f. sp. = form species? There's no such vocabulary in GBIF, only form

scientificNameAuthorship

The input_taxon file now contains two fields refering to an authorship:

sp_genus	sp_species	sp_authority	sp_subtaxon	sp_subtaxon_authority
Daucus	guttatus	Sibth. & Sm.	zahariadii	Heywood

I now integrated this information in the field scientificName:

Daucus guttatus Sibth. & Sm. zahariadii Heywood

This is a GBIF accepted taxon (tested is using the GBIF look up tool). But what is the scientificNameAuthorship here?
@qgroom

Multipart names in genus

The field genus contains:

Nematus (Pteronidea): Pteronidea should be in subgenus field
Acyrthosiphon Acyrthosiphon: Acyrthosiphon is probably a subgenus
Festuca arundinacea x Lolium multiflorum: genus should be empty
etc.

Different number of taxa in database <-> book

In the introduction of the DAISIE handbook, it is stated that the database includes 10857 species. However, in the taxon core, I count 12115 taxa (most of those are species).
@DavidRoy any explanations for this discrepancy?

metadata decisions required

For the following metadata fields, we need to decide the content.
@DavidRoy
@peterdesmet

For source datasets, we tend to use the following rules:

publisher = institutionCode = rightsHolder = the organization that had or was granted the permission to publish the data under the license it has. We make this closest to the source organization as possible

license: CC-BY
publisher: Centre for Ecology and Hydrology
institutionCode = CEH
rightsHolder: Centre for Ecology and Hydrology
datasetName= Inventory of alien species in Europe (DAISIE)

We could publish the checklist using the INBO IPT, but then we need to register CEH...

Duplicate higher classification names

Some names seem to appear twice and on multiple levels e.g. phylum Annelida and class annelida or class diplopoda and Diplopoda

Map pathway and vector information to CBD standard

Due to time constraints, I will not combine vector and pathway information in the description extension. However, it would be nice to see whether both can be combined and mapped to the CBD standard

Cleanup scientific names

@DavidRoy,
I use the following fields to generate the scientific names of the species:
genus, species, authority, subtaxon and subtaxon_authority.

Some of these fields contain \N. This renders scientific names that are not correct. I guess I can remove them?

Vespa velutina written as Ve velutina

Correct in raw/input_taxon.csv, incorrect in processed/taxon.csv

Standards for information in region field

Regional information is contained in the fields re_region_country and re_region_coast, with a reference to the following standards in the fields re_system_country and re_system_coast:

TDWG
IHO23_4
DIHO23_4
DAISIE

I found an overview of all recommended standards on GBIF. The TDWG and IHO standards are known by GBIF, so I can use these to map the data.
However:

What is the meaning of the numbers in IHO? (versions 2, 3 and 4? I only found version 2 and 3)
I could not find a DIHO standard
I suppose the DAISIE system is not really a standard?
All country information is written in full names. So I suppose I have to map them to the abbreviations myself.

@DavidRoy

Lowercase higher classification names

Class

acari
annelida
aranea
chilopoda
collembola
crustacea
diplopoda
insecta
...

Order

amphipoda
aranea
astigmata
...

Family

acrididae
adelgidae
aeolothripidae
...

Genus

Looks fine at first sight

Duplicated scientific names in taxon core

Several scientific names are duplicated in the taxon core:

dwc_scientificName	idspecies	ordo	family	genus
Cavia porcellus (Linnaeus 1758)	52831	Rodentia	Caviidae	Cavia
Cavia porcellus (Linnaeus 1758)	900966	NA	Caviidae	Cavia
Eliomys quercinus (Linnaeus 1766)	52845	Rodentia	Myoxidae	Eliomys
Eliomys quercinus (Linnaeus 1766)	900975	Rodentia	Gliridae	Eliomys
Erinaceus europaeus Linnaeus 1758	52847	Insectivora	Erinaceidae	Erinaceus
Erinaceus europaeus Linnaeus 1758	900977	Insectivora	Erinaceidae	Erinaceus
Hystrix cristata Linnaeus 1758	52858	Rodentia	Istricidae	Hystrix
Hystrix cristata Linnaeus 1758	900984	Cyperales	Istricidae	Hystrix
Mus musculus Linnaeus 1758	52877	Rodentia	Muridae	Mus
Mus musculus Linnaeus 1758	900995	Rodentia	Muridae	Mus
Mustela lutreola	52878	Carnivora	Mustelidae	Mustela
Mustela lutreola	900997	Carnivora	Mustelidae	Mustela
Pecari tajacu (Linnaeus 1758)	52889	Artiodactyla	Tayassuidae	Pecari
Pecari tajacu (Linnaeus 1758)	901002	NA	Tayassuidae	Pecari

Some of these differ in their classification. I compared those species with the GBIF backbone, and suggest to remove those that are not conform to the backbone taxonomy. I suggest to keep the following records in the taxon core:

dwc_scientificName	idspecies	ordo	family	genus
Cavia porcellus (Linnaeus 1758)	52831	Rodentia	Caviidae	Cavia
Eliomys quercinus (Linnaeus 1766)	900975	Rodentia	Gliridae	Eliomys
Erinaceus europaeus Linnaeus 1758	52847	Insectivora	Erinaceidae	Erinaceus
Hystrix cristata Linnaeus 1758	52858	Rodentia	Istricidae	Hystrix
Mus musculus Linnaeus 1758	52877	Rodentia	Muridae	Mus
Mustela lutreola	52878	Carnivora	Mustelidae	Mustela
Pecari tajacu (Linnaeus 1758)	52889	Artiodactyla	Tayassuidae	Pecari

Clean eventDate data

When inspecting date information, the following dates are odd:

Negative start_year for 174 records and starting from the year -7000 (!)
Negative end_year for 124 records and starting form the year -6000
For 29 distributions: the start year occurs after the end year (and not because I misinterpret the negative values :-) )

With respect to 1 and 2, I can hardly imagine these species to be alien as the are introduced many many years ago. I would suggest to leave eventDate information empty for the records in 1, 2 and 3. This only affects about 200 distributions (a total of 56000 distributions = 0.35%)

compare taxonRank info with info from nameparser function

Information about the taxon rank can be provided in two ways:

By using the information in subtaxon_rank in the taxon core. This relates to the original information in DAISIE. This information should thus only apply for subtaxa
By using the information in rankmarker, provided by the GBIF nameparser

It would be interesting to see the differences between the returns of the GBIF nameparser and the content of the subtaxon_rank field

sources in taxon core

The taxon core contains 2 fields reserved for sources: bibliographicCitation and references (see here).

However, the only valid use of the term bibliographicCitation is to indicate the source of the taxon record (a field “source” is not available in the taxon core, in contrast with distribution extension). refernces should be a URL to that record on a public website. However, some sources included in raw_taxon are none of both categories, e.g.:

Carpaneto G (1990) The Indian Grey Mangoose (Herpestes edwardsii) in the Circeo national Park:a case of incidental introduction. Mustelid and Viverrid Conservation, 2:10

Some are:

Davis PH et al (eds) (1965) Flora of Turkey and the East Aegean Islands, vol 1. University Press, Edinburgh

I'm tempted to exclude this information, but of course, then we lose information from the original dataset.
suggestions? @peterdesmet @DavidRoy

Incorrect Swedish vernacular name

Hi everyone,

I don't know if this is the place to report this type of issue but one of our users on GBIF noticed that the Swedish vernacular name for Centaurea jacea is incorrect.
He writes:

This is not "Blomsterlupin" in Swedish. Centaurea jacea is "rödklint" in Swedish. Blomserlupin is Lupinus polyphyllus.

See original GitHub issue: gbif/backbone-feedback#459
If this is not the place to report this type of issue, please let me know where I should log it. Thanks!!

Add kingdom information

Kingdom information is now excluded from the checklist.
In a lost moment, I will add the information in the Rmarkdown file

Move distribution related information to occurrenceRemarks

Mentioned in #25: many of the descriptions are related to a specific distribution.

Rather expressing this information in the description extension:

taxonid	description	type
100035	TDWG:CYP \| IHO23_4:M3.1 - Established	degree of establishment
100035	TDWG:CYP \| IHO23_4:M3.1 - Cape Greco to Cape Andreas	region of first record
100035	TDWG:CYP \| IHO23_4:M3.1 - Dispersed	vector
100035	TDWG:CYP \| IHO23_4:M3.1 - Red Sea	donor area
100035	TDWG:CYP \| IHO23_4:M3.1 - Unknown	impact on ecology
100035	TDWG:CYP \| IHO23_4:M3.1 - Known	impact on use
100035	TDWG:CYP \| IHO23_4:M3.1 - Canals	pathway

I think it will be more readable to express these in context with the distribution in occurrenceRemarks:

taxonid	locationID	locality	...	occurrenceRemarks
100035	TDWG:CYP \| IHO23_4:M3.1	Cyprus \| Mediterranean Sea	...	population_status: established \| region_of_first_record: Cape Greco to Cape Andreas \| current_distribution: NA \| vector: unintentional \| donor_region: red sea \| impact_on_ecology: unknown \| impact_on_use: known \| pathway: canals

clean scientificNames and taxonRank

Some severe inspection is required for the scientific names and taxon ranks.
A quick scan using the GBIF nameparser reveals that almost 1800 scientific names could not be (partially) parsed, which gives an indication of nomenclatural issues. These include:

type	records
CANDIDATUS	1
CULTIVAR	16
DOUBTFUL	59
HYBRID	56
INFORMAL	40
SCIENTIFIC	1627

Especially the nomenclatural issues of the type = scientific should be inspected carefully.

I will publish the information now as it is integrated in the dataset.
Afterwards, we can take the time to inspect the data more in-dept.

UTF-8 issue

I have the same problem here with the UTF-8 encoding of the taxon core.
The encoding of the raw data file is ok, but when exporting it to a .csv and also when viewing it in R, it renders some problems with the encoding. I guess this is the same issue as this issue for the manual of alien plants. I cannot remember if we ever solved this one, although problems with the encoding did not occur for a long time..
I checked for the manual of alien plants, and the UTF-8 issue remains there as well, so I guess we didn't solve the problem yet.
@peterdesmet can you fresh up my mind please?

Extension taxa not in core

Important: the following taxonIDs in distribution do not have a corresponding taxon (probably my fault):

Idem for vernacularname:

And description:

Clean up mapping for European part of Russia

For the regional information European part of Russia, the mapping to the TDWG standard could be nicer. Improve this when time is available

References in the distribution extension

The attribution of the sources to the records in the distribution extension differs from approach used in the other dwc files. In the case of the distribution extension, a reference (in input_literature_references.csv) is attributed to a specific field_name (column name in the raw data file) x id_sp_region (the id of a taxon in a particular region) combination in input_distribution:

id_sp_region	field_name	reference
2365	ecoimpact_id	aaa et al. (xxx)
2365	first_introduction	bbb et al. (yyyy)

(example from input_literature_references)

The field_names as given in input_literature_references are:

current_distrib
current_distribution
distribution
ecoimpact_id
ecological impact
first observation
general references
history_is_known
impact on uses
introduction dates
introduction history
status
useimpact_id

The thing is, these field names do not correspond at all with the field names in input_distribution. Above that, some field_names are very similar, such as current_distrib and current_distribution.

This is what I suggest:

Extract references for current_distrib, current_distribution, ecoimpact_id, ecological impact, , first_observation, impact on uses,status and useimpact_id, and link these references with the respective piece of information in the description extension (all these terms are mapped in the description).
Use distribution, general_references, introduction dates and introduction_history as references in the distribution extension, as they refer to fields included in this extension. This would be a concatenated string using | as a separator.
@DavidRoy

References without species id

In input_references, about 8000 references have no link with an idspecies or id_sp_region. I find that a bit odd. These references will never be used in the other extensions, as each row contains a idspecies field (which I use to join two tables together).
I prefer to throw these references out. In this way, the literature references will include sources that always link to a certain taxon. @DavidRoy?

Mapping of abundance

The DwC field occurrenceStatus will be, in certain cases, a combination between the fields abundance and population_status.

In most cases , we can translate abundance to the GBIF controlled vocabulary of occurrenceStatus. This is what I suggest (not al mapping is straightforward):

abundance	occurrenceStatus
common	common
abundant	common
rare	rare
local	present
single record	present
sporadic	irregular
unknown	doubtful

However, for 30 taxa, population_status contains the field extinct, which is valuable information for occurrenceStatus:

population_status	abundance	records
Extinct	Absent or extinct	9
Extinct	Local	6
Extinct	Rare	11
Extinct	Single record	4
Extinct	Unknown	8

I would suggest the following:

population_status	abundance	occurrenceStatus
Extinct	Absent or extinct	extinct
Extinct	Unknown	exinct

For the remaining abundance data (21 records in total) we could split the data (this is the same approach as in the manual of alien plants):

population_status	abundance	occurrenceStatus	eventDate
Extinct	local	present	start_year/end_year
Extinct	rare	rare	start_year/end_year
Extinct	single record	present	start_year/end_year
Extinct	local	extinct	end_year/now
Extinct	rare	extinct	end_year/now
Extinct	single record	extinct	end_year/now

However, as it only concerns 21 records, I would keep it simple, and just map the abundance categories to extinct

establishmentMeans mapping

Information for establishmentMeans is contained in ss_species_status.
This is what I suggest for the mapping to the establishmentMeans vocabulary:

ss_species_status	establishmentMeans
Alien	introduced
Cryptogenic	uncertain
Naturalized	naturalised
Casual	naturalised
Alien_invasive	invasive

I chose to map casual to naturalised, based on the argument by @timadriaens here

How to publish DAISIE

The background information of DAISIE is published in the book "Handbook of Alien Species in Europe". In this book, each taxonomical group is published as a different chapter, with its own authorship:

Alien fungi
Alien Bryophtytes and Lichens
Alien vascular plants
Alien Terrestrial invertebrates
Alien invertebrates and fish
Alien marine biota
Alien birds, amphians and reptiles
Alien mammals

For the publication of DAISIE, I got two options:

Publish a subset of DAISIE for each taxonomical group
Advantage: it would be easier for researchers to scan for their specific species group
Advantage: it would greatly reduce the list of authors (only max. of 6 authors)
Advantage: each checklist would be published with its own metadata (from the book)
Advantage: each checklist could be link under the project information "DAISIE" (as we do in TrIAS)
Disadvantage: This implies that 9 (!) DAISIE checklists will be published
Disadvantage: much more work writing the metadata
Publish the checklist as one large checklist on GBIF.
Advantage: evident, only one DAISE checklist on GBIF, which is our usual approach. Much easier to import the data and filter on the data as a whole
Disadvantage: a very large group of authors.

I discussed this with @DavidRoy and for him it's not an option to only include the first author of each taxonomical group, and to include the rest in the metadata (if we would publish the dataset as a whole)

I prefer option 2, and just keep the checklist as a whole, which is what we always try to do. Of course I understand the advantages of splitting up the checklist.

@qgroom, @peterdesmet , @timadriaens @SoVDH ?