trias-project / daisie-checklist Goto Github PK
View Code? Open in Web Editor NEW🇪🇺 DAISIE - Inventory of alien species in Europe
Home Page: https://trias-project.github.io/daisie-checklist/
License: MIT License
🇪🇺 DAISIE - Inventory of alien species in Europe
Home Page: https://trias-project.github.io/daisie-checklist/
License: MIT License
The following taxa are duplicated in the taxon core:
9741
,9813
,9828
,10002
, 10378
,11143
,11158
,11191
,11200
,11646
,12114
,12116
,12117
(due to the presence/lack in the authorship field)
We could filter them out of the taxon core, but the problem is that there could be information linked to these taxa in other extension files. If we remove them, people won't be able to search for taxon information. So for now I would keep them in the checklist, and if I have time, I could check whether or not these duplicated taxa re-occur in the extensions
In line with #17, I discovered some inconsistencies in the phylum and order information:
Bacteria
is a kingdom, not a phylumMore inconsistenties will appear when scanning order or family information, but this is outside the scope of the mapping process.
The information for taxonRank
is contained in the field sp_subtaxon_rank
.
taxonRank has a [controlled vocabulary](
This is what I suggest for the mapping:
sp_subtaxon_rank | taxonRank |
---|---|
var. | variety |
hyb. | [1] |
subsp. | subspecies |
agg. | [2] |
subspecies | subspecies |
x | [3] |
Cytosporina sp. | [4] |
var | variety |
f. | form |
f. sp. | [5] |
sp. | species |
Crous | [4] |
[1] with respect to the mapping of hybrids, see this issue. I will integrate this accordingly.
[2] agg. stands for aggregate. I found these ranks for some genera and some species. I suspect that an aggregate for a species name is called speciesAggregate
. However, I could not find an equivalent for genera. Could I just consider them as genera? @qgroom suggestions?
[4] errors?
[5] f. sp. = form species? There's no such vocabulary in GBIF, only form
The input_taxon
file now contains two fields refering to an authorship:
sp_genus | sp_species | sp_authority | sp_subtaxon | sp_subtaxon_authority |
---|---|---|---|---|
Daucus | guttatus | Sibth. & Sm. | zahariadii | Heywood |
I now integrated this information in the field scientificName
:
Daucus guttatus Sibth. & Sm. zahariadii Heywood
This is a GBIF accepted taxon (tested is using the GBIF look up tool). But what is the scientificNameAuthorship
here?
@qgroom
The field genus contains:
Nematus (Pteronidea): Pteronidea should be in subgenus field
Acyrthosiphon Acyrthosiphon: Acyrthosiphon is probably a subgenus
Festuca arundinacea x Lolium multiflorum: genus should be empty
etc.
In the introduction of the DAISIE handbook, it is stated that the database includes 10857 species. However, in the taxon core, I count 12115 taxa (most of those are species).
@DavidRoy any explanations for this discrepancy?
For the following metadata fields, we need to decide the content.
@DavidRoy
@peterdesmet
For source datasets, we tend to use the following rules:
publisher = institutionCode = rightsHolder = the organization that had or was granted the permission to publish the data under the license it has. We make this closest to the source organization as possible
We could publish the checklist using the INBO IPT, but then we need to register CEH...
Some names seem to appear twice and on multiple levels e.g. phylum Annelida
and class annelida
or class diplopoda
and Diplopoda
Due to time constraints, I will not combine vector and pathway information in the description extension. However, it would be nice to see whether both can be combined and mapped to the CBD standard
@DavidRoy,
I use the following fields to generate the scientific names of the species:
genus
, species
, authority
, subtaxon
and subtaxon_authority
.
Some of these fields contain \N
. This renders scientific names that are not correct. I guess I can remove them?
Correct in raw/input_taxon.csv
, incorrect in processed/taxon.csv
Regional information is contained in the fields re_region_country
and re_region_coast
, with a reference to the following standards in the fields re_system_country
and re_system_coast
:
I found an overview of all recommended standards on GBIF. The TDWG and IHO standards are known by GBIF, so I can use these to map the data.
However:
acari
annelida
aranea
chilopoda
collembola
crustacea
diplopoda
insecta
...
amphipoda
aranea
astigmata
...
acrididae
adelgidae
aeolothripidae
...
Looks fine at first sight
Several scientific names are duplicated in the taxon core:
dwc_scientificName | idspecies | ordo | family | genus |
---|---|---|---|---|
Cavia porcellus (Linnaeus 1758) | 52831 | Rodentia | Caviidae | Cavia |
Cavia porcellus (Linnaeus 1758) | 900966 | NA | Caviidae | Cavia |
Eliomys quercinus (Linnaeus 1766) | 52845 | Rodentia | Myoxidae | Eliomys |
Eliomys quercinus (Linnaeus 1766) | 900975 | Rodentia | Gliridae | Eliomys |
Erinaceus europaeus Linnaeus 1758 | 52847 | Insectivora | Erinaceidae | Erinaceus |
Erinaceus europaeus Linnaeus 1758 | 900977 | Insectivora | Erinaceidae | Erinaceus |
Hystrix cristata Linnaeus 1758 | 52858 | Rodentia | Istricidae | Hystrix |
Hystrix cristata Linnaeus 1758 | 900984 | Cyperales | Istricidae | Hystrix |
Mus musculus Linnaeus 1758 | 52877 | Rodentia | Muridae | Mus |
Mus musculus Linnaeus 1758 | 900995 | Rodentia | Muridae | Mus |
Mustela lutreola | 52878 | Carnivora | Mustelidae | Mustela |
Mustela lutreola | 900997 | Carnivora | Mustelidae | Mustela |
Pecari tajacu (Linnaeus 1758) | 52889 | Artiodactyla | Tayassuidae | Pecari |
Pecari tajacu (Linnaeus 1758) | 901002 | NA | Tayassuidae | Pecari |
Some of these differ in their classification. I compared those species with the GBIF backbone, and suggest to remove those that are not conform to the backbone taxonomy. I suggest to keep the following records in the taxon core:
dwc_scientificName | idspecies | ordo | family | genus |
---|---|---|---|---|
Cavia porcellus (Linnaeus 1758) | 52831 | Rodentia | Caviidae | Cavia |
Eliomys quercinus (Linnaeus 1766) | 900975 | Rodentia | Gliridae | Eliomys |
Erinaceus europaeus Linnaeus 1758 | 52847 | Insectivora | Erinaceidae | Erinaceus |
Hystrix cristata Linnaeus 1758 | 52858 | Rodentia | Istricidae | Hystrix |
Mus musculus Linnaeus 1758 | 52877 | Rodentia | Muridae | Mus |
Mustela lutreola | 52878 | Carnivora | Mustelidae | Mustela |
Pecari tajacu (Linnaeus 1758) | 52889 | Artiodactyla | Tayassuidae | Pecari |
When inspecting date information, the following dates are odd:
start_year
for 174 records and starting from the year -7000 (!)end_year
for 124 records and starting form the year -6000With respect to 1 and 2, I can hardly imagine these species to be alien as the are introduced many many years ago. I would suggest to leave eventDate information empty for the records in 1, 2 and 3. This only affects about 200 distributions (a total of 56000 distributions = 0.35%)
Information about the taxon rank can be provided in two ways:
subtaxon_rank
in the taxon core. This relates to the original information in DAISIE. This information should thus only apply for subtaxarankmarker
, provided by the GBIF nameparserIt would be interesting to see the differences between the returns of the GBIF nameparser and the content of the subtaxon_rank
field
The taxon core contains 2 fields reserved for sources: bibliographicCitation
and references
(see here).
However, the only valid use of the term bibliographicCitation
is to indicate the source of the taxon record (a field “source” is not available in the taxon core, in contrast with distribution extension). refernces
should be a URL to that record on a public website. However, some sources included in raw_taxon are none of both categories, e.g.:
Carpaneto G (1990) The Indian Grey Mangoose (Herpestes edwardsii) in the Circeo national Park:a case of incidental introduction. Mustelid and Viverrid Conservation, 2:10
Some are:
Davis PH et al (eds) (1965) Flora of Turkey and the East Aegean Islands, vol 1. University Press, Edinburgh
I'm tempted to exclude this information, but of course, then we lose information from the original dataset.
suggestions? @peterdesmet @DavidRoy
Hi everyone,
I don't know if this is the place to report this type of issue but one of our users on GBIF noticed that the Swedish vernacular name for Centaurea jacea is incorrect.
He writes:
This is not "Blomsterlupin" in Swedish. Centaurea jacea is "rödklint" in Swedish. Blomserlupin is Lupinus polyphyllus.
See original GitHub issue: gbif/backbone-feedback#459
If this is not the place to report this type of issue, please let me know where I should log it. Thanks!!
Kingdom information is now excluded from the checklist.
In a lost moment, I will add the information in the Rmarkdown file
Mentioned in #25: many of the descriptions are related to a specific distribution.
Rather expressing this information in the description extension:
taxonid | description | type |
---|---|---|
100035 | TDWG:CYP | IHO23_4:M3.1 - Established | degree of establishment |
100035 | TDWG:CYP | IHO23_4:M3.1 - Cape Greco to Cape Andreas | region of first record |
100035 | TDWG:CYP | IHO23_4:M3.1 - Dispersed | vector |
100035 | TDWG:CYP | IHO23_4:M3.1 - Red Sea | donor area |
100035 | TDWG:CYP | IHO23_4:M3.1 - Unknown | impact on ecology |
100035 | TDWG:CYP | IHO23_4:M3.1 - Known | impact on use |
100035 | TDWG:CYP | IHO23_4:M3.1 - Canals | pathway |
I think it will be more readable to express these in context with the distribution in occurrenceRemarks:
taxonid | locationID | locality | ... | occurrenceRemarks |
---|---|---|---|---|
100035 | TDWG:CYP | IHO23_4:M3.1 | Cyprus | Mediterranean Sea | ... | population_status: established | region_of_first_record: Cape Greco to Cape Andreas | current_distribution: NA | vector: unintentional | donor_region: red sea | impact_on_ecology: unknown | impact_on_use: known | pathway: canals |
Some severe inspection is required for the scientific names and taxon ranks.
A quick scan using the GBIF nameparser reveals that almost 1800 scientific names could not be (partially) parsed, which gives an indication of nomenclatural issues. These include:
type | records |
---|---|
CANDIDATUS | 1 |
CULTIVAR | 16 |
DOUBTFUL | 59 |
HYBRID | 56 |
INFORMAL | 40 |
SCIENTIFIC | 1627 |
Especially the nomenclatural issues of the type = scientific should be inspected carefully.
I will publish the information now as it is integrated in the dataset.
Afterwards, we can take the time to inspect the data more in-dept.
I have the same problem here with the UTF-8 encoding of the taxon core.
The encoding of the raw data file is ok, but when exporting it to a .csv and also when viewing it in R, it renders some problems with the encoding. I guess this is the same issue as this issue for the manual of alien plants. I cannot remember if we ever solved this one, although problems with the encoding did not occur for a long time..
I checked for the manual of alien plants, and the UTF-8 issue remains there as well, so I guess we didn't solve the problem yet.
@peterdesmet can you fresh up my mind please?
Important: the following taxonID
s in distribution do not have a corresponding taxon (probably my fault):
50244 1
54403 1
54563 1
90231 1
105323 50
105356 49
Idem for vernacularname:
44 1
1201 3
5992 1
6978 16
8522 1
11921 4
17544 4
17906 16
19540 2
22885 8
23736 1
50244 1
50828 1
53392 1
53517 1
53523 1
53524 1
53530 1
53578 1
And description:
10569 1
22885 11
50244 6
50286 3
53282 1
53385 1
53419 1
53571 3
53578 1
53602 1
54331 1
54642 2
90231 1
For the regional information European part of Russia
, the mapping to the TDWG standard could be nicer. Improve this when time is available
The attribution of the sources to the records in the distribution extension differs from approach used in the other dwc files. In the case of the distribution extension, a reference (in input_literature_references.csv
) is attributed to a specific field_name
(column name in the raw data file) x id_sp_region
(the id of a taxon in a particular region) combination in input_distribution
:
id_sp_region | field_name | reference |
---|---|---|
2365 | ecoimpact_id | aaa et al. (xxx) |
2365 | first_introduction | bbb et al. (yyyy) |
(example from input_literature_references
)
The field_names as given in input_literature_references
are:
current_distrib
current_distribution
distribution
ecoimpact_id
ecological impact
first observation
general references
history_is_known
impact on uses
introduction dates
introduction history
status
useimpact_id
The thing is, these field names do not correspond at all with the field names in input_distribution
. Above that, some field_names are very similar, such as current_distrib
and current_distribution
.
This is what I suggest:
current_distrib
, current_distribution
, ecoimpact_id
, ecological impact
, , first_observation
, impact on uses
,status
and useimpact_id
, and link these references with the respective piece of information in the description extension (all these terms are mapped in the description).distribution
, general_references
, introduction dates
and introduction_history
as references in the distribution extension, as they refer to fields included in this extension. This would be a concatenated string using |
as a separator.In input_references
, about 8000 references have no link with an idspecies
or id_sp_region
. I find that a bit odd. These references will never be used in the other extensions, as each row contains a idspecies
field (which I use to join two tables together).
I prefer to throw these references out. In this way, the literature references will include sources that always link to a certain taxon. @DavidRoy?
The DwC field occurrenceStatus
will be, in certain cases, a combination between the fields abundance
and population_status
.
In most cases , we can translate abundance
to the GBIF controlled vocabulary of occurrenceStatus
. This is what I suggest (not al mapping is straightforward):
abundance | occurrenceStatus |
---|---|
common | common |
abundant | common |
rare | rare |
local | present |
single record | present |
sporadic | irregular |
unknown | doubtful |
However, for 30 taxa, population_status
contains the field extinct
, which is valuable information for occurrenceStatus
:
population_status | abundance | records |
---|---|---|
Extinct | Absent or extinct | 9 |
Extinct | Local | 6 |
Extinct | Rare | 11 |
Extinct | Single record | 4 |
Extinct | Unknown | 8 |
I would suggest the following:
population_status | abundance | occurrenceStatus |
---|---|---|
Extinct | Absent or extinct | extinct |
Extinct | Unknown | exinct |
For the remaining abundance data (21 records in total) we could split the data (this is the same approach as in the manual of alien plants):
population_status | abundance | occurrenceStatus | eventDate |
---|---|---|---|
Extinct | local | present | start_year/end_year |
Extinct | rare | rare | start_year/end_year |
Extinct | single record | present | start_year/end_year |
Extinct | local | extinct | end_year/now |
Extinct | rare | extinct | end_year/now |
Extinct | single record | extinct | end_year/now |
However, as it only concerns 21 records, I would keep it simple, and just map the abundance categories to extinct
Information for establishmentMeans
is contained in ss_species_status
.
This is what I suggest for the mapping to the establishmentMeans vocabulary:
ss_species_status | establishmentMeans |
---|---|
Alien | introduced |
Cryptogenic | uncertain |
Naturalized | naturalised |
Casual | naturalised |
Alien_invasive | invasive |
I chose to map casual
to naturalised
, based on the argument by @timadriaens here
The background information of DAISIE is published in the book "Handbook of Alien Species in Europe". In this book, each taxonomical group is published as a different chapter, with its own authorship:
For the publication of DAISIE, I got two options:
Publish a subset of DAISIE for each taxonomical group
Advantage: it would be easier for researchers to scan for their specific species group
Advantage: it would greatly reduce the list of authors (only max. of 6 authors)
Advantage: each checklist would be published with its own metadata (from the book)
Advantage: each checklist could be link under the project information "DAISIE" (as we do in TrIAS)
Disadvantage: This implies that 9 (!) DAISIE checklists will be published
Disadvantage: much more work writing the metadata
Publish the checklist as one large checklist on GBIF.
Advantage: evident, only one DAISE checklist on GBIF, which is our usual approach. Much easier to import the data and filter on the data as a whole
Disadvantage: a very large group of authors.
I discussed this with @DavidRoy and for him it's not an option to only include the first author of each taxonomical group, and to include the rest in the metadata (if we would publish the dataset as a whole)
I prefer option 2, and just keep the checklist as a whole, which is what we always try to do. Of course I understand the advantages of splitting up the checklist.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.