trias-project / alien-plants-belgium Goto Github PK

View Code? Open in Web Editor NEW

0.0 5.0 1.0 9.63 MB

🌿 Manual of the Alien Plants of Belgium

Home Page: https://trias-project.github.io/alien-plants-belgium

License: MIT License

CSS 100.00%

dataset checklist r gbif rstats oscibio invasive-species

alien-plants-belgium's Introduction

Manual of the Alien Plants of Belgium

Rationale

This repository contains the functionality to standardize the Manual of the Alien Plants of Belgium to a Darwin Core checklist that can be harvested by GBIF. It was developed for the TrIAS project.

Workflow

source data → Darwin Core mapping script → generated Darwin Core files

Published dataset

Repo structure

The repository structure is based on Cookiecutter Data Science and the Checklist recipe. Files and directories indicated with GENERATED should not be edited manually.

├── README.md              : Description of this repository
├── LICENSE                : Repository license
├── alien-plants-belgium.Rproj : RStudio project file
├── .gitignore             : Files and directories to be ignored by git
│
├── data
│   ├── raw                : Source data, input for mapping script
│   └── processed          : Darwin Core output of mapping script GENERATED
│
├── docs                   : Repository website GENERATED
│
└── src
    ├── dwc_mapping.Rmd    : Darwin Core mapping script, core functionality of this repository
    ├── _site.yml          : Settings to build website in docs/
    └── index.Rmd          : Template for website homepage

Installation

Clone this repository to your computer
Open the RStudio project file
Open the dwc_mapping.Rmd R Markdown file in RStudio
Install any required packages
Click Run > Run All to generate the processed data
Alternatively, click Build > Build website to generate the processed data and build the website in docs/

Contributors

List of contributors

License

MIT License

alien-plants-belgium's People

Contributors

Watchers

Forkers

lienreyserhove

alien-plants-belgium's Issues

Duplicate description

In his ongoing quest for order in the universe 🕵️‍♂️ @damianooldoni has discovered a duplicate description in this dataset:

http://api.gbif.org/v1/species/141265904/descriptions has cbd_2014_pathway:contaminant_seed twice.

The taxon in question is Panicum dichotomiflorum Michaux with taxonID alien-plants-belgium:taxon:1667966a270681c6dbab1474b0fa0c9b and the duplicate description is probably due to the values Grain, seeds, wool mapping to cbd_2014_pathway:contaminant_seed twice. It can easily be solved by doing a check for duplicates on the extension and removing those. Such a check would actually be good for all extensions for all datasets.

Region distribution not always conclusive

Even though we won't use the X and ? information in the presence columns, I wanted to highlight that there are:

22 records where all 3 regions have ?
2 records where all 3 regions are blank

There is not always a relation with how the D/N column is populated. Maybe something to look into.

Update hybrid taxonRank

In response to gbif/portal-feedback#1354 (comment), change taxonRank from hybrid to the lowest rank for 135 taxa. Note: not all are species, e.g. Cornus sanguinea subsp. hungarica (Kárpáti) Soó should get rank subspecies.

That should solve rank unknown issues.

How to express year first seen

@qgroom @SoVDH @timadriaens @DimEvil

The manual of alien plants has a column FR (first record?), which lists values like:

?
<1824
1976
ca. 1975

This information only makes sense when it is related to a region. However, this information is not available for Fl, Br and Wa separately, but generally (so assuming "Belgium"). The best way to then express that information is to add distribution records for "Belgium" in addition to the more detailed Fl, Br and Wa distribution regions:

Flanders: present
Wallonia: unknown
Belgium: present, firstRecord: 1996

The dataset also contains a column MRR (definition??), with years that are always later than FR. Is this "last seen"? Not sure what to do with this. It also contains values like:

Ann.
N
Nat.?

And finally, we'll need to think how to best express the firstRecordYear information. Most are a YYYY year, but there are some odd values and questionmarks too, which we might need to standardize? It basically boils down to how we want to use that information later on.

Ignore synonym information

We decided not to include synonym information in the published dataset.

Ignore mode of introduction (M/I)

Does not seem to contain more useful information (Deliberate / Accidental) than V/I, so we'll ignore this in the mapping.

@qgroom is that correct?

Resource is registered with BBPF instead of BGM

The resource is currently registered with the BBPF. @DimEvil @andrejjh, can you register this with the Botanic Garden Meise instead?

Some IDs in extension not found in core

According to the DwC validator, there are IDs in the distribution (19) and description (43) extension that do not appear in the core: https://www.gbif.org/tools/data-validator/1509439657182. Look for the error Record referential integrity violation.

One such ID (in both extension, not in core) is 2482

Use .Rmd file instead of .R

Create dropdown for V/I

V/I = mode of introduction = pathway = establishmentMeans currently contains good values, so we won't replace this with the Blackburn et al. controlled vocabulary (as we'll also lose some granularity). The current values are not always formatted consistently however.

Clean list for formatting (e.g. remove ,...)
Provide as a dropdown
Multiple values are still possible

Decide on actual values to use for establishmentMeans

We have are four options:

Take values from first checklist vocab by @SoVDH and @timadriaens: contaminant > transportation of habitat material (soil, vegetation,…)
Shorten those values (current implementation): contaminant:habitat_material
Use proposed pathway vocab by @qgroom: transportation of habitat material (no category info in values)
Shorten those values...

Use character encoding

Some scientific names contain <U+00E9>:

Daucus glochidiatus (Labill.) Fisch., C.A. Mey. et Av<U+00E9>-Lall.

Which is probably due to character encoding.

establishmentMeans

establishmentMeans will be based on V/I and:

Mapped to Blackburn et al.
Multiple values will be provided in the single establishmentMeans term, but separated by pipe.
This information will only be added for distribution records for Belgium (not Flanders, etc.)
Remove field establishmentMeans from spreadsheet

Importing and exporting to UTF-8

There seems to be a problem with the data encoding when importing from and exporting to the taxon dataset to a .csv file.

With respect to the importing:

A warning message emerges:

Warning message: In Sys.setlocale("LC_ALL", "en_US.UTF-8") : OS reports request to set locale to "en_US.UTF-8" cannot be honored

This warning message only occurs in Windows and can be changed after applying

Sys.setlocale(,` "English_United States.1252")

see this table of locales and the following GitHub issue

I already adapted this in the script

With respect to the exporting:

The taxon data is not exported as UTF-8, despite the fileEncoding = "UTF-8" argument in the write.csv statement

E.g. in line 188 from the taxon dataset, the special character č in the scientificName Pastinaca sativa L. subsp. urens (Req. ex Godr.) čelak is reversed to c.

I suspect this problem is somehow due to a printing bug in R, despite the fact that the data it is correctly read and stored. Apparently, the print() method on data.frame tries to round-trip characters through the active encoding, which is lossy when converting UTF-8 encoded characters. (see this GitHub issue) . However, this is just an idea, I have tried several other options, without result.

@qgroom do you have any idea what the problem is here?

Does bold text mean anything in the checklist

Don't use parsed scientific name columns or classification

GBIF will parse the scientificName and link to the backbone taxonomy. We don't want to maintain our own parsed scientific name columns or classifications, so those newly added columns should be removed:

Don't use hybrid as taxonRank

GBIF will indicate those with rank unknown. That makes sense, you could argue that Berberis x hybrido-gagnepainii J.V. Suringar should have the rank species. I'd therefore suggest to use the correct taxonRank in the Excel file.

We already have a column hybridFormula which should also allow to find the 125 hybrids in the data. For 15 of the 125 it's not populated yet, but once we have that you filter on hybrids again.

GBIF also automatically detect hybridFormula names: https://www.gbif.org/species/search?dataset_key=9ff7d317-609b-4c08-bd86-3bc404b77c42&name_type=HYBRID&issue=RANK_INVALID&advanced=1

To be coordinated with Filip

the use of extinct as a degreeOfEstablishment/invasionStage is confusing

We have based degreeOfEstablishment/invasionStage on Blackburn et al 2011, which does not have a term for extinct and I don't think it should. We have placed the term extinct under occurrenceStatus as a subcategory of absent.
Currently, the term extinct is being used in the description extension in the Manual under the invasionStage as one element that links to the taxon. This is problematic because this one entry can cover two entries in the distribution extenstion one for the period that the taxon was present and one when it was absent (extinct).

Add degree of establishment

Add degree of establishment to the description extension, with the values:

established
introduced
data deficient (which includes to be determined by experts)

I don't remember on which field in the Excel this was based, but it should be in the Google Doc by Tim. I also don't know the name of the used vocabulary (if any).

Vocab for native range

There is a vocab for native range suggested in the Excel file. I have some questions and suggestions about it and would like to have it confirmed:

original value	original suggested value	new suggested value	comment
AM	pan-American	Americas	For the region, Wikipedia suggests "Americas"
NAM	Northern America	Northern America	Do we mean Northern America (US & CA) or North America (includes Mexico and Latin America)??
SAM	Southern America	South America	Is redirected on Wikipedia to South America, so would use that name.
AF	Africa	Africa
E	Europe	Europe
AUS	Australasia	Australasia	Has Wikipedia page
AS	Asia	Asia
AS-Tr	tropical Asia	Tropical Asia	Has Wikipedia page and is written with capital letter in sentences.
AS-Te	temperate Asia	Temperate Asia	Does not have Wikipedia page, just a category, where it is mentioned as a biogeographical region, with capital letters.
Trop.	Pantropical	pantropical	Is written lowercase on Wikipedia
Cult.	cultivated origin	cultivated	I don't think there is a need to indicate "origin" in `native range = cultivated`
Hybr.	Hybrid origin	hybridization	I think this works better as `native range = hybridization`
?	unknown		For the moment, any ? value is left blank and not included in the description extension

D/N information should apply to year range

As discussed with @Groom, the degree of naturalization (D/N) should apply to the year range defined by FR (first record) and MRR (most recent record).

Currently it is confusing if a casual taxon with an MRR of e.g. 1945 should be labelled extinct or casual. The best way is to indicate what the status was of that taxon for that time period:

1944    1957    casual

That also means that none of the taxa should be labelled as extinct.

This information should be updated.

ID's in the manual

Reported by @DimEvil at trias-project/alien-species-checklist#87:

Are we sure we would like ID's like 1,2,3.....
I would propose to make the
ID's more unique....

Should we add info for Flanders, Brussels, Wallonia in distribution

Information regarding pathway, first record and most recent record all apply to the distribution in Belgium. Extrapolating that information to Flanders, Wallonia and Brussels is somewhat odd. The checklist does contain minimal presence information regarding those regions ( X or ?).

So the question is, do we publish distribution information about the regions:

locality	eventDate	occurrenceStatus	establishmentMeans
Belgium	1945-2017	casual	seed_contaminant
Flanders	implied?	implied?	implied?

Or not:

locality	eventDate	occurrenceStatus	establishmentMeans
Belgium	1945-2017	casual	seed_contaminant

Decide on name for "origin" as description

The description extension currently uses the type name "origin" for vagrant, introduced, etc. We need to decide if we want to keep it that way.

The name "origin" is discussed here: qgroom/ias-dwc-proposal#5

Add column hybridFormula

The column Taxon sometimes contains the hybrid formula between parenthesis.

Taxon: Chenopodium x preissmannii J. Murr (C. album L. x opulifolium)

This will make it difficult for the GBIF name parsing, so it's better to move this information to a separate column:

Taxon: Chenopodium x preissmannii J. Murr
HybridFormula: C. album L. x opulifolium

For hybrid without a name, both columns should be populated with the same info:

Taxon: Chenopodium album L. x hircinum
HybridFormula: Chenopodium album L. x hircinum

Ideally, the HybridFormula should also be more complete consistent, like Chenopodium album L. x Chenopodium opulifolium L.

Note: the hybridFormula information will not be mapped to published data.

Add taxonRank dropdown

taxonRank is a required term. The suggestion vocabulary is this one but it does not contain hybrids. We need this information for preparsing some of the names (in case the GBIF parsing fails)

Suggestion for terms:

genus
species
subspecies
variety
cultivar
hybrid             own term
hybridFormula      own term

Don't require meta.xml

Manually creating a meta.xml is not necessary. Figure out with @qgroom how to do that.

Decide on name for "native range" in description

The description extension currently uses the type name "native range" for Americas, Australasia, etc. We need to decide if we want to keep it that way.

The name "native range" came up in this discussion: qgroom/ias-dwc-proposal#5 and is what was suggested as a term here

Remove parentNameUsageID

Currently contains link to a name (to IPNI), not a nameUsage (= taxon).

Full names for hybrid formulas

Chenopodium album L. x hircinum can be interpreted as:

A named subspecies hybrid of Chenopodium album
A hybrid formula: Chenopodium album L. x Chenopodium hircinum L.

If the latter, I would spell out the name in full. It's best to do this for all hybrid formulas.

Degree of establishment

Won't try to map this information for now, remove column from spreadsheet.

@qgroom correct?

fields FL, Br, WA

Hi there, as you know the idea is also to use the checklists to feed regional biodiversity indicators. Therefore, these fields are important to us to include. For a number of the other checklists, we can infer them: the non-native fish are for Flanders only, the macroinvertebrates as well so these are all Fl: present. Inferring this from the occurrences will not work, as many species figure on checklists based on published information rather than occurrences. Is it still possible to include this @peterdesmet ?

occurrenceStatus

Should be extant (= present) for all taxa, asextinct will be removed (see #9). As this is a very simple mapping, the column occurrenceStatus should be removed from the spreadsheet.

Species with doubtful degree of establishment should not have doubtdul occurrenceStatus

Copied from @qgroom trias-project/indicators#21 (comment)

Some of your doubtful species are certainly present e.g. Pinus sylvestris, Narcissus pseudonarcissus, so perhaps it is doubtful that they are naturalized. For P. sylvestris the checklist says it is doubtful only for Brussels. He must mean it is doubtfully naturalized, but cause I'm 100% sure it is present.

Picea abies, Pinus rigida and Pinus pinaster are probably similar cases. They are planted, but it is doubtful if they are escaping.

For N. pseudonarcissus the checklist is only refering to Narcissus pseudonarcissus L. subsp. major . Not N. pseudonarcissus in general. It is an important distinction as N. pseudonarcissus is a native species, but not this subspecies.

The rest are probably genuinely doubtful species. That is they are doubtfully present.

@LienReyserhove can you ask Filip to change this from ? to X in the source data?

Using new establishmentMeans causes "distribution:invalid" warning on GBIF

The checklist taxa get a distribution:invalid tag on GBIF (e.g. https://www.gbif.org/species/134086869). We have to figure out why.

Empty invasion stage for Galatella sedifolia (L.) Greuter

Noted one record with an empty invasion stage:

"alien-plants-belgium:taxon:00bb2bc05d9481b0bc8b5b57117fa799","","invasion stage","en"

Is Galatella sedifolia (L.) Greuter with empty D/N. Is the only record with empty D/N. Would not create record for empty D/N.

Set establishmentMeans to "introduced"

As long as TDWG hasn't adopted (and GBIF hasn't implemented) the use of a pathway vocabulary for establishmentMeans, I propose that we stick to the current one that is in use:

http://rs.gbif.org/vocabulary/gbif/establishment_means.xml

Which is:

native
introduced
- naturalised
- invasive
- managed
uncertain

I would opt to use introduced (which has synonyms exotic and alien) for all distributions (Belgium + regions) even for presence uncertain, because we know all those species are alien. I wouldn't go further in stating if something is naturalized or invasive.

That should solve the distribution error we currently get #35
Make the data more useful for folks looking for introduced species
Our pathway information is still published in the description extension, unconcatenated (which is even better)

Update headers in source file

Remove empty line 1
Update Presence to Presence_Fl, Presence_Br, Presence_Wa
Remove empty 3 and 4
Add column ID as first column

Stable IDs in the manual

Taxon IDs are currently stored in the Excel file. However, when new species are added (alphabetically), the ID column is reassigned by Filip, so all IDs run again from 1 to x. That means that almost all species get assigned new IDs.

Option 1: Explain Filip that those IDs should be kept stable, but that requires manual addition of new IDs... which will eventually lead to duplicates or gaps.

Option 2: Generate an ID (hash) in the R script, based on the scientific name. If a name gets updated, it gets a new ID (which is logical, except for typographic corrections, but we can live with that). Has the advantage that records won't change in git every time the ID column gets recalculated.

I prefer option 2.

Have asked GBIF if taxonID is even used to assign taxonKeys in GBIF: gbif/portal-feedback#688

origin

Valid values for this dataset are:

vagrant for D/N = Cas.
introduced for D/N = Nat. or Inv.

As this is a simple mapping, the column origin should be removed from the spreadsheet

Map description extension

Currently not published.

Difference in number of taxa: source <-> GBIF

The source dataset has 2,500 records (= taxa), while on GBIF it is displayed as 2,653. We have to figure out why.

Add invasion stage

Suggestion: add invasion stage to the description extension. The mapping would look like this (occurrenceStatus and establishmentMeans in distribution extension):

raw data (D/N)	invasion stage	occurrenceStatus	establishmentMeans
Cas.	Casual	present	introduced
Cas.?	Casual	present	introduced
Ext / Cas.	Casual	present	introduced
Ext		absent	introduced
Ext?		absent	introduced
Inv.	established	present	introduced
Nat?	established	present	introduced
Nat.	established	present	introduced
Nat.?	established	present	introduced

Background:

In the latest TrIAS core group meeting, we decided to integrate invasion stage in the description extension of the Manual of Alien Plants. Similar as for the Registry of Alien Macroinvertebrates in Flanders, Belgium and the Checklist of Non-native Freshwater Fishes in Flanders, Belgium, we use the invasion stage vocubulary listed in Blackburn et al. 2011.

To integrate invasion stage in the description extension, we need to make to some decisions:

Discard origin information:

The information needed for invasion stage is contained in D/N (degree of naturalisation) in the raw data file (see table for the terms). This information can perfectly be mapped to the Blackburn et al. (2011) vocabulary. This field is currently used for the mapping of origin in the description extension (see #12). I would discard the term origin and replace it by invasion stage

Interpretation of D/N values:

For the previous checklists, we used the invasion stages casual and established. We decided to discard the terms naturalized or invasive listed in Blackburn et al. (see trias-project/alien-fishes-checklist#6 (comment)). So, naturalized and invasive in D/N are replaced by established. casual in D/N, of course, remains casual.
D/N = Extinct has no alternative in Blackburn et al. (2011). For this, I would leave invasion stage empty and supplement the information with information in occurrenceStatus and establishmentMeans (see table on top). See occurrenceStatus GBIF vocabulary: "An extinct organism is absent while its establishmentMeans is native". Of course, as we are working with invasive species, establishmentMeans would be introduced in this case.
We "ignore" the question marks in 'D/N' (see #12 (comment))

Extend metadata

Decide how to handle <yyyy and yyyy? first records dates

Proposal:

<1861 => 1500/1860
? => 1500

How to express invasionStage for plants?

@qgroom @SoVDH @timadriaens what invasion stage information do we want to extract from the manual of alien plants?

This is the information we have:

D/N is information you have already mapped (filter on "plants")
presence_value comes from the presence columns Fl, Wa, and Br, which can have x for present and ? for unknown (I assume).

In it's most simple form, we only express that a taxon is:

present = x for that region
unknown = ? for that region

But we could also take some of the information from the D/N column, to make more nuanced statements, such as established, introduced, to be determined, unknown. Note that even for values in D/N without a question mark, we do have question marks in the presence_value column (and vice versa), so we need to decide which one takes preference.
And finally, a lot of taxa have empty presences for regions. We could either express this as absent from that region, or take the more cautious approach (maybe there was no monitoring for that region) and not say anything about the occurrence of that species in that region.

Could you give some input regarding this, so we know how to basically map the above table to occurrenceStatus (that's the field to use right @qgroom)?

Add genus, specificEpithet and infraspecificEpithet

Could be generated from the scientificName with the GBIF name parser. We could then choose one of the following approaches:

Add as 3 columns to source file

Bulk of it could be populated once with GBIF name parser and then maintained manually.

Main advantage: speed + control

Add as 3 columns in mapping

Would have to be parsed every time (slowing down the publication)

Main advantage: no maintenance

Add speciesProfile information

@qgroom can you ask Filip to add a column to the source file called environment, with the following 3 values:

M for isMarine
F for isFreshwater
T for isTerrestrial

The column can contain multiple values, such as F/T, separated by /.

We will use this information to populate the http://rs.gbif.org/extension/gbif/1.0/speciesprofile.xml

Note: we will also indicate FALSE if something is not indicated, so T will have:

isMarine: false
isFreshwater: false
isTerrestrial: true

trias-project / alien-plants-belgium Goto Github PK

alien-plants-belgium's Introduction

Manual of the Alien Plants of Belgium

Rationale

Workflow

Published dataset

Repo structure

Installation

Contributors

License

alien-plants-belgium's People

Contributors

Watchers

Forkers

alien-plants-belgium's Issues

Background:

Discard origin information:

Interpretation of D/N values:

Add as 3 columns to source file

Add as 3 columns in mapping

Recommend Projects

Recommend Topics

Recommend Org