Giter Club home page Giter Club logo

daisie-checklist's People

Contributors

lienreyserhove avatar peterdesmet avatar stijnvanhoey avatar

Watchers

 avatar  avatar  avatar

daisie-checklist's Issues

Some taxa are duplicated

The following taxa are duplicated in the taxon core:
9741 ,9813 ,9828 ,10002, 10378 ,11143 ,11158 ,11191 ,11200 ,11646 ,12114 ,12116 ,12117
(due to the presence/lack in the authorship field)

We could filter them out of the taxon core, but the problem is that there could be information linked to these taxa in other extension files. If we remove them, people won't be able to search for taxon information. So for now I would keep them in the checklist, and if I have time, I could check whether or not these duplicated taxa re-occur in the extensions

Wrong phylum and order information

In line with #17, I discovered some inconsistencies in the phylum and order information:

  • Nematoda` is a phylum, not a class
  • Bacteria is a kingdom, not a phylum

More inconsistenties will appear when scanning order or family information, but this is outside the scope of the mapping process.

taxonRank mappig

The information for taxonRank is contained in the field sp_subtaxon_rank.
taxonRank has a [controlled vocabulary](
This is what I suggest for the mapping:

sp_subtaxon_rank taxonRank
var. variety
hyb. [1]
subsp. subspecies
agg. [2]
subspecies subspecies
x [3]
Cytosporina sp. [4]
var variety
f. form
f. sp. [5]
sp. species
Crous [4]

[1] with respect to the mapping of hybrids, see this issue. I will integrate this accordingly.

[2] agg. stands for aggregate. I found these ranks for some genera and some species. I suspect that an aggregate for a species name is called speciesAggregate. However, I could not find an equivalent for genera. Could I just consider them as genera? @qgroom suggestions?

[4] errors?

[5] f. sp. = form species? There's no such vocabulary in GBIF, only form

scientificNameAuthorship

The input_taxon file now contains two fields refering to an authorship:

sp_genus sp_species sp_authority sp_subtaxon sp_subtaxon_authority
Daucus guttatus Sibth. & Sm. zahariadii Heywood

I now integrated this information in the field scientificName:

Daucus guttatus Sibth. & Sm. zahariadii Heywood

This is a GBIF accepted taxon (tested is using the GBIF look up tool). But what is the scientificNameAuthorship here?
@qgroom

Multipart names in genus

The field genus contains:

Nematus (Pteronidea): Pteronidea should be in subgenus field
Acyrthosiphon Acyrthosiphon: Acyrthosiphon is probably a subgenus
Festuca arundinacea x Lolium multiflorum: genus should be empty
etc.

Different number of taxa in database <-> book

In the introduction of the DAISIE handbook, it is stated that the database includes 10857 species. However, in the taxon core, I count 12115 taxa (most of those are species).
@DavidRoy any explanations for this discrepancy?

metadata decisions required

For the following metadata fields, we need to decide the content.
@DavidRoy
@peterdesmet

For source datasets, we tend to use the following rules:

publisher = institutionCode = rightsHolder = the organization that had or was granted the permission to publish the data under the license it has. We make this closest to the source organization as possible

  • license: CC-BY
  • publisher: Centre for Ecology and Hydrology
  • institutionCode = CEH
  • rightsHolder: Centre for Ecology and Hydrology
  • datasetName= Inventory of alien species in Europe (DAISIE)

We could publish the checklist using the INBO IPT, but then we need to register CEH...

Cleanup scientific names

@DavidRoy,
I use the following fields to generate the scientific names of the species:
genus, species, authority, subtaxon and subtaxon_authority.

Some of these fields contain \N. This renders scientific names that are not correct. I guess I can remove them?

Standards for information in region field

Regional information is contained in the fields re_region_country and re_region_coast, with a reference to the following standards in the fields re_system_country and re_system_coast:

  • TDWG
  • IHO23_4
  • DIHO23_4
  • DAISIE

I found an overview of all recommended standards on GBIF. The TDWG and IHO standards are known by GBIF, so I can use these to map the data.
However:

  1. What is the meaning of the numbers in IHO? (versions 2, 3 and 4? I only found version 2 and 3)
  2. I could not find a DIHO standard
  3. I suppose the DAISIE system is not really a standard?
  4. All country information is written in full names. So I suppose I have to map them to the abbreviations myself.

@DavidRoy

Lowercase higher classification names

Class

acari
annelida
aranea
chilopoda
collembola
crustacea
diplopoda
insecta
...

Order

amphipoda
aranea
astigmata
...

Family

acrididae
adelgidae
aeolothripidae
...

Genus

Looks fine at first sight

Duplicated scientific names in taxon core

Several scientific names are duplicated in the taxon core:

dwc_scientificName idspecies ordo family genus
Cavia porcellus (Linnaeus 1758) 52831 Rodentia Caviidae Cavia
Cavia porcellus (Linnaeus 1758) 900966 NA Caviidae Cavia
Eliomys quercinus (Linnaeus 1766) 52845 Rodentia Myoxidae Eliomys
Eliomys quercinus (Linnaeus 1766) 900975 Rodentia Gliridae Eliomys
Erinaceus europaeus Linnaeus 1758 52847 Insectivora Erinaceidae Erinaceus
Erinaceus europaeus Linnaeus 1758 900977 Insectivora Erinaceidae Erinaceus
Hystrix cristata Linnaeus 1758 52858 Rodentia Istricidae Hystrix
Hystrix cristata Linnaeus 1758 900984 Cyperales Istricidae Hystrix
Mus musculus Linnaeus 1758 52877 Rodentia Muridae Mus
Mus musculus Linnaeus 1758 900995 Rodentia Muridae Mus
Mustela lutreola 52878 Carnivora Mustelidae Mustela
Mustela lutreola 900997 Carnivora Mustelidae Mustela
Pecari tajacu (Linnaeus 1758) 52889 Artiodactyla Tayassuidae Pecari
Pecari tajacu (Linnaeus 1758) 901002 NA Tayassuidae Pecari

Some of these differ in their classification. I compared those species with the GBIF backbone, and suggest to remove those that are not conform to the backbone taxonomy. I suggest to keep the following records in the taxon core:

dwc_scientificName idspecies ordo family genus
Cavia porcellus (Linnaeus 1758) 52831 Rodentia Caviidae Cavia
Eliomys quercinus (Linnaeus 1766) 900975 Rodentia Gliridae Eliomys
Erinaceus europaeus Linnaeus 1758 52847 Insectivora Erinaceidae Erinaceus
Hystrix cristata Linnaeus 1758 52858 Rodentia Istricidae Hystrix
Mus musculus Linnaeus 1758 52877 Rodentia Muridae Mus
Mustela lutreola 52878 Carnivora Mustelidae Mustela
Pecari tajacu (Linnaeus 1758) 52889 Artiodactyla Tayassuidae Pecari

Clean eventDate data

When inspecting date information, the following dates are odd:

  1. Negative start_year for 174 records and starting from the year -7000 (!)
  2. Negative end_year for 124 records and starting form the year -6000
  3. For 29 distributions: the start year occurs after the end year (and not because I misinterpret the negative values :-) )

With respect to 1 and 2, I can hardly imagine these species to be alien as the are introduced many many years ago. I would suggest to leave eventDate information empty for the records in 1, 2 and 3. This only affects about 200 distributions (a total of 56000 distributions = 0.35%)

compare taxonRank info with info from nameparser function

Information about the taxon rank can be provided in two ways:

  1. By using the information in subtaxon_rank in the taxon core. This relates to the original information in DAISIE. This information should thus only apply for subtaxa
  2. By using the information in rankmarker, provided by the GBIF nameparser

It would be interesting to see the differences between the returns of the GBIF nameparser and the content of the subtaxon_rank field

sources in taxon core

The taxon core contains 2 fields reserved for sources: bibliographicCitation and references (see here).

However, the only valid use of the term bibliographicCitation is to indicate the source of the taxon record (a field “source” is not available in the taxon core, in contrast with distribution extension). refernces should be a URL to that record on a public website. However, some sources included in raw_taxon are none of both categories, e.g.:

Carpaneto G (1990) The Indian Grey Mangoose (Herpestes edwardsii) in the Circeo national Park:a case of incidental introduction. Mustelid and Viverrid Conservation, 2:10

Some are:

Davis PH et al (eds) (1965) Flora of Turkey and the East Aegean Islands, vol 1. University Press, Edinburgh

I'm tempted to exclude this information, but of course, then we lose information from the original dataset.
suggestions? @peterdesmet @DavidRoy

Incorrect Swedish vernacular name

Hi everyone,

I don't know if this is the place to report this type of issue but one of our users on GBIF noticed that the Swedish vernacular name for Centaurea jacea is incorrect.
He writes:

This is not "Blomsterlupin" in Swedish. Centaurea jacea is "rödklint" in Swedish. Blomserlupin is Lupinus polyphyllus.

See original GitHub issue: gbif/backbone-feedback#459
If this is not the place to report this type of issue, please let me know where I should log it. Thanks!!

Add kingdom information

Kingdom information is now excluded from the checklist.
In a lost moment, I will add the information in the Rmarkdown file

Move distribution related information to occurrenceRemarks

Mentioned in #25: many of the descriptions are related to a specific distribution.

Rather expressing this information in the description extension:

taxonid description type
100035 TDWG:CYP | IHO23_4:M3.1 - Established degree of establishment
100035 TDWG:CYP | IHO23_4:M3.1 - Cape Greco to Cape Andreas region of first record
100035 TDWG:CYP | IHO23_4:M3.1 - Dispersed vector
100035 TDWG:CYP | IHO23_4:M3.1 - Red Sea donor area
100035 TDWG:CYP | IHO23_4:M3.1 - Unknown impact on ecology
100035 TDWG:CYP | IHO23_4:M3.1 - Known  impact on use
100035  TDWG:CYP | IHO23_4:M3.1 - Canals  pathway

I think it will be more readable to express these in context with the distribution in occurrenceRemarks:

taxonid locationID locality ... occurrenceRemarks
100035 TDWG:CYP | IHO23_4:M3.1 Cyprus | Mediterranean Sea ... population_status: established | region_of_first_record: Cape Greco to Cape Andreas | current_distribution: NA | vector: unintentional | donor_region: red sea | impact_on_ecology: unknown | impact_on_use: known | pathway: canals

clean scientificNames and taxonRank

Some severe inspection is required for the scientific names and taxon ranks.
A quick scan using the GBIF nameparser reveals that almost 1800 scientific names could not be (partially) parsed, which gives an indication of nomenclatural issues. These include:

type records
CANDIDATUS 1
CULTIVAR 16
DOUBTFUL 59
HYBRID 56
INFORMAL 40
SCIENTIFIC 1627

Especially the nomenclatural issues of the type = scientific should be inspected carefully.

I will publish the information now as it is integrated in the dataset.
Afterwards, we can take the time to inspect the data more in-dept.

UTF-8 issue

I have the same problem here with the UTF-8 encoding of the taxon core.
The encoding of the raw data file is ok, but when exporting it to a .csv and also when viewing it in R, it renders some problems with the encoding. I guess this is the same issue as this issue for the manual of alien plants. I cannot remember if we ever solved this one, although problems with the encoding did not occur for a long time..
I checked for the manual of alien plants, and the UTF-8 issue remains there as well, so I guess we didn't solve the problem yet.
@peterdesmet can you fresh up my mind please?

Extension taxa not in core

Important: the following taxonIDs in distribution do not have a corresponding taxon (probably my fault):

50244	1
54403	1
54563	1
90231	1
105323	50
105356	49

Idem for vernacularname:

44	1
1201	3
5992	1
6978	16
8522	1
11921	4
17544	4
17906	16
19540	2
22885	8
23736	1
50244	1
50828	1
53392	1
53517	1
53523	1
53524	1
53530	1
53578	1

And description:

10569	1
22885	11
50244	6
50286	3
53282	1
53385	1
53419	1
53571	3
53578	1
53602	1
54331	1
54642	2
90231	1

References in the distribution extension

The attribution of the sources to the records in the distribution extension differs from approach used in the other dwc files. In the case of the distribution extension, a reference (in input_literature_references.csv) is attributed to a specific field_name (column name in the raw data file) x id_sp_region (the id of a taxon in a particular region) combination in input_distribution:

id_sp_region field_name reference
2365 ecoimpact_id aaa et al. (xxx)
2365 first_introduction bbb et al. (yyyy)

(example from input_literature_references)

The field_names as given in input_literature_references are:

  • current_distrib
  • current_distribution
  • distribution
  • ecoimpact_id
  • ecological impact
  • first observation
  • general references
  • history_is_known
  • impact on uses
  • introduction dates
  • introduction history
  • status
  • useimpact_id

The thing is, these field names do not correspond at all with the field names in input_distribution. Above that, some field_names are very similar, such as current_distrib and current_distribution.

This is what I suggest:

  • Extract references for current_distrib, current_distribution, ecoimpact_id, ecological impact, , first_observation, impact on uses,status and useimpact_id, and link these references with the respective piece of information in the description extension (all these terms are mapped in the description).
  • Use distribution, general_references, introduction dates and introduction_history as references in the distribution extension, as they refer to fields included in this extension. This would be a concatenated string using | as a separator.
    @DavidRoy

References without species id

In input_references, about 8000 references have no link with an idspecies or id_sp_region. I find that a bit odd. These references will never be used in the other extensions, as each row contains a idspecies field (which I use to join two tables together).
I prefer to throw these references out. In this way, the literature references will include sources that always link to a certain taxon. @DavidRoy?

Mapping of abundance

The DwC field occurrenceStatus will be, in certain cases, a combination between the fields abundance and population_status.

In most cases , we can translate abundance to the GBIF controlled vocabulary of occurrenceStatus. This is what I suggest (not al mapping is straightforward):

abundance occurrenceStatus
common common
abundant common
rare rare
local present
single record present
sporadic irregular
unknown doubtful

However, for 30 taxa, population_status contains the field extinct, which is valuable information for occurrenceStatus:

population_status abundance records
Extinct Absent or extinct 9
Extinct Local 6
Extinct Rare 11
Extinct Single record 4
Extinct Unknown 8

I would suggest the following:

population_status abundance occurrenceStatus
Extinct Absent or extinct extinct
Extinct Unknown exinct

For the remaining abundance data (21 records in total) we could split the data (this is the same approach as in the manual of alien plants):

population_status abundance occurrenceStatus eventDate
Extinct local present start_year/end_year
Extinct rare rare start_year/end_year
Extinct single record present start_year/end_year
Extinct local extinct end_year/now
Extinct rare extinct end_year/now
Extinct single record extinct end_year/now

However, as it only concerns 21 records, I would keep it simple, and just map the abundance categories to extinct

establishmentMeans mapping

Information for establishmentMeans is contained in ss_species_status.
This is what I suggest for the mapping to the establishmentMeans vocabulary:

ss_species_status establishmentMeans
Alien introduced
Cryptogenic uncertain
Naturalized naturalised
Casual naturalised
Alien_invasive invasive

I chose to map casual to naturalised, based on the argument by @timadriaens here

How to publish DAISIE

The background information of DAISIE is published in the book "Handbook of Alien Species in Europe". In this book, each taxonomical group is published as a different chapter, with its own authorship:

  1. Alien fungi
  2. Alien Bryophtytes and Lichens
  3. Alien vascular plants
  4. Alien Terrestrial invertebrates
  5. Alien invertebrates and fish
  6. Alien marine biota
  7. Alien birds, amphians and reptiles
  8. Alien mammals

For the publication of DAISIE, I got two options:

  1. Publish a subset of DAISIE for each taxonomical group
    Advantage: it would be easier for researchers to scan for their specific species group
    Advantage: it would greatly reduce the list of authors (only max. of 6 authors)
    Advantage: each checklist would be published with its own metadata (from the book)
    Advantage: each checklist could be link under the project information "DAISIE" (as we do in TrIAS)
    Disadvantage: This implies that 9 (!) DAISIE checklists will be published
    Disadvantage: much more work writing the metadata

  2. Publish the checklist as one large checklist on GBIF.
    Advantage: evident, only one DAISE checklist on GBIF, which is our usual approach. Much easier to import the data and filter on the data as a whole
    Disadvantage: a very large group of authors.

I discussed this with @DavidRoy and for him it's not an option to only include the first author of each taxonomical group, and to include the rest in the metadata (if we would publish the dataset as a whole)

I prefer option 2, and just keep the checklist as a whole, which is what we always try to do. Of course I understand the advantages of splitting up the checklist.

@qgroom, @peterdesmet , @timadriaens @SoVDH ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.