Giter Club home page Giter Club logo

alien-plants-belgium's Introduction

Manual of the Alien Plants of Belgium

Rationale

This repository contains the functionality to standardize the Manual of the Alien Plants of Belgium to a Darwin Core checklist that can be harvested by GBIF. It was developed for the TrIAS project.

Workflow

source data β†’ Darwin Core mapping script β†’ generated Darwin Core files

Published dataset

Repo structure

The repository structure is based on Cookiecutter Data Science and the Checklist recipe. Files and directories indicated with GENERATED should not be edited manually.

β”œβ”€β”€ README.md              : Description of this repository
β”œβ”€β”€ LICENSE                : Repository license
β”œβ”€β”€ alien-plants-belgium.Rproj : RStudio project file
β”œβ”€β”€ .gitignore             : Files and directories to be ignored by git
β”‚
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ raw                : Source data, input for mapping script
β”‚   └── processed          : Darwin Core output of mapping script GENERATED
β”‚
β”œβ”€β”€ docs                   : Repository website GENERATED
β”‚
└── src
    β”œβ”€β”€ dwc_mapping.Rmd    : Darwin Core mapping script, core functionality of this repository
    β”œβ”€β”€ _site.yml          : Settings to build website in docs/
    └── index.Rmd          : Template for website homepage

Installation

  1. Clone this repository to your computer
  2. Open the RStudio project file
  3. Open the dwc_mapping.Rmd R Markdown file in RStudio
  4. Install any required packages
  5. Click Run > Run All to generate the processed data
  6. Alternatively, click Build > Build website to generate the processed data and build the website in docs/

Contributors

List of contributors

License

MIT License

alien-plants-belgium's People

Contributors

lienreyserhove avatar peterdesmet avatar qgroom avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

lienreyserhove

alien-plants-belgium's Issues

Duplicate description

In his ongoing quest for order in the universe πŸ•΅οΈβ€β™‚οΈ @damianooldoni has discovered a duplicate description in this dataset:

http://api.gbif.org/v1/species/141265904/descriptions has cbd_2014_pathway:contaminant_seed twice.

The taxon in question is Panicum dichotomiflorum Michaux with taxonID alien-plants-belgium:taxon:1667966a270681c6dbab1474b0fa0c9b and the duplicate description is probably due to the values Grain, seeds, wool mapping to cbd_2014_pathway:contaminant_seed twice. It can easily be solved by doing a check for duplicates on the extension and removing those. Such a check would actually be good for all extensions for all datasets.

Region distribution not always conclusive

Even though we won't use the X and ? information in the presence columns, I wanted to highlight that there are:

  • 22 records where all 3 regions have ?
  • 2 records where all 3 regions are blank

There is not always a relation with how the D/N column is populated. Maybe something to look into.

How to express year first seen

@qgroom @SoVDH @timadriaens @DimEvil

The manual of alien plants has a column FR (first record?), which lists values like:

?
<1824
1976
ca. 1975
  1. This information only makes sense when it is related to a region. However, this information is not available for Fl, Br and Wa separately, but generally (so assuming "Belgium"). The best way to then express that information is to add distribution records for "Belgium" in addition to the more detailed Fl, Br and Wa distribution regions:
Flanders: present
Wallonia: unknown
Belgium: present, firstRecord: 1996
  1. The dataset also contains a column MRR (definition??), with years that are always later than FR. Is this "last seen"? Not sure what to do with this. It also contains values like:
Ann.
N
Nat.?
  1. And finally, we'll need to think how to best express the firstRecordYear information. Most are a YYYY year, but there are some odd values and questionmarks too, which we might need to standardize? It basically boils down to how we want to use that information later on.

Create dropdown for V/I

V/I = mode of introduction = pathway = establishmentMeans currently contains good values, so we won't replace this with the Blackburn et al. controlled vocabulary (as we'll also lose some granularity). The current values are not always formatted consistently however.

  • Clean list for formatting (e.g. remove ,...)
  • Provide as a dropdown
  • Multiple values are still possible

establishmentMeans

establishmentMeans will be based on V/I and:

  • Mapped to Blackburn et al.
  • Multiple values will be provided in the single establishmentMeans term, but separated by pipe.
  • This information will only be added for distribution records for Belgium (not Flanders, etc.)
  • Remove field establishmentMeans from spreadsheet

Importing and exporting to UTF-8

There seems to be a problem with the data encoding when importing from and exporting to the taxon dataset to a .csv file.

With respect to the importing:

A warning message emerges:

Warning message: In Sys.setlocale("LC_ALL", "en_US.UTF-8") : OS reports request to set locale to "en_US.UTF-8" cannot be honored

This warning message only occurs in Windows and can be changed after applying

Sys.setlocale(,` "English_United States.1252")

see this table of locales and the following GitHub issue

I already adapted this in the script

With respect to the exporting:

The taxon data is not exported as UTF-8, despite the fileEncoding = "UTF-8" argument in the write.csv statement

E.g. in line 188 from the taxon dataset, the special character č in the scientificName Pastinaca sativa L. subsp. urens (Req. ex Godr.) čelak is reversed to c.

I suspect this problem is somehow due to a printing bug in R, despite the fact that the data it is correctly read and stored. Apparently, the print() method on data.frame tries to round-trip characters through the active encoding, which is lossy when converting UTF-8 encoded characters. (see this GitHub issue) . However, this is just an idea, I have tried several other options, without result.

@qgroom do you have any idea what the problem is here?
 

Don't use parsed scientific name columns or classification

GBIF will parse the scientificName and link to the backbone taxonomy. We don't want to maintain our own parsed scientific name columns or classifications, so those newly added columns should be removed:

  • Taxon: keep
  • Family: keep, as Filip will keep maintaining that one
  • Genus: remove
  • specificEpithet: remove
  • authority: remove
  • kingdom: remove (is used by GBIF, but it's all Plantae, so easy to add in mapping)
  • phylum: remove
  • order: remove
  • infraspecificEpithet: remove
  • scientificNameAuthorship: remove
  • nomenclaturalCode: remove (all ICBN, so do this in mapping)
  • taxonRank: keep, as this is useful for filtering data + also required by GBIF

Don't use hybrid as taxonRank

GBIF will indicate those with rank unknown. That makes sense, you could argue that Berberis x hybrido-gagnepainii J.V. Suringar should have the rank species. I'd therefore suggest to use the correct taxonRank in the Excel file.

We already have a column hybridFormula which should also allow to find the 125 hybrids in the data. For 15 of the 125 it's not populated yet, but once we have that you filter on hybrids again.

GBIF also automatically detect hybridFormula names: https://www.gbif.org/species/search?dataset_key=9ff7d317-609b-4c08-bd86-3bc404b77c42&name_type=HYBRID&issue=RANK_INVALID&advanced=1

To be coordinated with Filip

the use of extinct as a degreeOfEstablishment/invasionStage is confusing

We have based degreeOfEstablishment/invasionStage on Blackburn et al 2011, which does not have a term for extinct and I don't think it should. We have placed the term extinct under occurrenceStatus as a subcategory of absent.
Currently, the term extinct is being used in the description extension in the Manual under the invasionStage as one element that links to the taxon. This is problematic because this one entry can cover two entries in the distribution extenstion one for the period that the taxon was present and one when it was absent (extinct).

Add degree of establishment

Add degree of establishment to the description extension, with the values:

  • established
  • introduced
  • data deficient (which includes to be determined by experts)

I don't remember on which field in the Excel this was based, but it should be in the Google Doc by Tim. I also don't know the name of the used vocabulary (if any).

Vocab for native range

There is a vocab for native range suggested in the Excel file. I have some questions and suggestions about it and would like to have it confirmed:

original value original suggested value Β new suggested value comment
AM pan-American Americas Β For the region, Wikipedia suggests "Americas"
NAM Northern America Northern America Do we mean Northern America (US & CA) or North America (includes Mexico and Latin America)??
SAM Southern America South America Is redirected on Wikipedia to South America, so would use that name.
AF Β Africa Africa
E Europe Europe
AUS Australasia Australasia Has Wikipedia page
AS Asia Asia
AS-Tr tropical Asia Tropical Asia Has Wikipedia page and is written with capital letter in sentences.
AS-Te temperate Asia Temperate Asia Does not have Wikipedia page, just a category, where it is mentioned as a biogeographical region, with capital letters.
Trop. Pantropical pantropical Is written lowercase on Wikipedia
Cult. cultivated origin cultivated I don't think there is a need to indicate "origin" in native range = cultivated
Hybr. Hybrid origin hybridization I think this works better as native range = hybridization
? unknown For the moment, any ? value is left blank and not included in the description extension

D/N information should apply to year range

As discussed with @Groom, the degree of naturalization (D/N) should apply to the year range defined by FR (first record) and MRR (most recent record).

Currently it is confusing if a casual taxon with an MRR of e.g. 1945 should be labelled extinct or casual. The best way is to indicate what the status was of that taxon for that time period:

1944    1957    casual

That also means that none of the taxa should be labelled as extinct.

This information should be updated.

Should we add info for Flanders, Brussels, Wallonia in distribution

Information regarding pathway, first record and most recent record all apply to the distribution in Belgium. Extrapolating that information to Flanders, Wallonia and Brussels is somewhat odd. The checklist does contain minimal presence information regarding those regions ( X or ?).

So the question is, do we publish distribution information about the regions:

locality eventDate occurrenceStatus establishmentMeans
Belgium 1945-2017 casual seed_contaminant
Flanders implied? implied? implied?

Or not:

locality eventDate occurrenceStatus establishmentMeans
Belgium 1945-2017 casual seed_contaminant

Add column hybridFormula

The column Taxon sometimes contains the hybrid formula between parenthesis.

Taxon: Chenopodium x preissmannii J. Murr (C. album L. x opulifolium)

This will make it difficult for the GBIF name parsing, so it's better to move this information to a separate column:

Taxon: Chenopodium x preissmannii J. Murr
HybridFormula: C. album L. x opulifolium

For hybrid without a name, both columns should be populated with the same info:

Taxon: Chenopodium album L. x hircinum
HybridFormula: Chenopodium album L. x hircinum

Ideally, the HybridFormula should also be more complete consistent, like Chenopodium album L. x Chenopodium opulifolium L.

Note: the hybridFormula information will not be mapped to published data.

Add taxonRank dropdown

taxonRank is a required term. The suggestion vocabulary is this one but it does not contain hybrids. We need this information for preparsing some of the names (in case the GBIF parsing fails)

Suggestion for terms:

genus
species
subspecies
variety
cultivar
hybrid             own term
hybridFormula      own term

Full names for hybrid formulas

Chenopodium album L. x hircinum can be interpreted as:

  1. A named subspecies hybrid of Chenopodium album
  2. A hybrid formula: Chenopodium album L. x Chenopodium hircinum L.

If the latter, I would spell out the name in full. It's best to do this for all hybrid formulas.

fields FL, Br, WA

Hi there, as you know the idea is also to use the checklists to feed regional biodiversity indicators. Therefore, these fields are important to us to include. For a number of the other checklists, we can infer them: the non-native fish are for Flanders only, the macroinvertebrates as well so these are all Fl: present. Inferring this from the occurrences will not work, as many species figure on checklists based on published information rather than occurrences. Is it still possible to include this @peterdesmet ?

occurrenceStatus

Should be extant (= present) for all taxa, asextinct will be removed (see #9). As this is a very simple mapping, the column occurrenceStatus should be removed from the spreadsheet.

Species with doubtful degree of establishment should not have doubtdul occurrenceStatus

Copied from @qgroom trias-project/indicators#21 (comment)

Some of your doubtful species are certainly present e.g. Pinus sylvestris, Narcissus pseudonarcissus, so perhaps it is doubtful that they are naturalized. For P. sylvestris the checklist says it is doubtful only for Brussels. He must mean it is doubtfully naturalized, but cause I'm 100% sure it is present.

Picea abies, Pinus rigida and Pinus pinaster are probably similar cases. They are planted, but it is doubtful if they are escaping.

For N. pseudonarcissus the checklist is only refering to Narcissus pseudonarcissus L. subsp. major . Not N. pseudonarcissus in general. It is an important distinction as N. pseudonarcissus is a native species, but not this subspecies.

The rest are probably genuinely doubtful species. That is they are doubtfully present.

@LienReyserhove can you ask Filip to change this from ? to X in the source data?

Empty invasion stage for Galatella sedifolia (L.) Greuter

Noted one record with an empty invasion stage:

"alien-plants-belgium:taxon:00bb2bc05d9481b0bc8b5b57117fa799","","invasion stage","en"

Is Galatella sedifolia (L.) Greuter with empty D/N. Is the only record with empty D/N. Would not create record for empty D/N.

Set establishmentMeans to "introduced"

As long as TDWG hasn't adopted (and GBIF hasn't implemented) the use of a pathway vocabulary for establishmentMeans, I propose that we stick to the current one that is in use:

http://rs.gbif.org/vocabulary/gbif/establishment_means.xml

Which is:

native
introduced
- naturalised
- invasive
- managed
uncertain

I would opt to use introduced (which has synonyms exotic and alien) for all distributions (Belgium + regions) even for presence uncertain, because we know all those species are alien. I wouldn't go further in stating if something is naturalized or invasive.

  1. That should solve the distribution error we currently get #35
  2. Make the data more useful for folks looking for introduced species
  3. Our pathway information is still published in the description extension, unconcatenated (which is even better)

Update headers in source file

  • Remove empty line 1
  • Update Presence to Presence_Fl, Presence_Br, Presence_Wa
  • Remove empty 3 and 4
  • Add column ID as first column

Stable IDs in the manual

Taxon IDs are currently stored in the Excel file. However, when new species are added (alphabetically), the ID column is reassigned by Filip, so all IDs run again from 1 to x. That means that almost all species get assigned new IDs.

Option 1: Explain Filip that those IDs should be kept stable, but that requires manual addition of new IDs... which will eventually lead to duplicates or gaps.

Option 2: Generate an ID (hash) in the R script, based on the scientific name. If a name gets updated, it gets a new ID (which is logical, except for typographic corrections, but we can live with that). Has the advantage that records won't change in git every time the ID column gets recalculated.

I prefer option 2.

Have asked GBIF if taxonID is even used to assign taxonKeys in GBIF: gbif/portal-feedback#688

origin

Valid values for this dataset are:

  • vagrant for D/N = Cas.
  • introduced for D/N = Nat. or Inv.

As this is a simple mapping, the column origin should be removed from the spreadsheet

Add invasion stage

Suggestion: add invasion stage to the description extension. The mapping would look like this (occurrenceStatus and establishmentMeans in distribution extension):

raw data (D/N) invasion stage occurrenceStatus establishmentMeans
Cas. Casual present introduced
Cas.? Casual present introduced
Ext / Cas. Casual present introduced
Ext absent introduced
Ext? absent introduced
Inv. established present introduced
Nat? established present introduced
Nat. established present introduced
Nat.? established present introduced

Background:

In the latest TrIAS core group meeting, we decided to integrate invasion stage in the description extension of the Manual of Alien Plants. Similar as for the Registry of Alien Macroinvertebrates in Flanders, Belgium and the Checklist of Non-native Freshwater Fishes in Flanders, Belgium, we use the invasion stage vocubulary listed in Blackburn et al. 2011.

To integrate invasion stage in the description extension, we need to make to some decisions:

Discard origin information:

The information needed for invasion stage is contained in D/N (degree of naturalisation) in the raw data file (see table for the terms). This information can perfectly be mapped to the Blackburn et al. (2011) vocabulary. This field is currently used for the mapping of origin in the description extension (see #12). I would discard the term origin and replace it by invasion stage

Interpretation of D/N values:

  1. For the previous checklists, we used the invasion stages casual and established. We decided to discard the terms naturalized or invasive listed in Blackburn et al. (see trias-project/alien-fishes-checklist#6 (comment)). So, naturalized and invasive in D/N are replaced by established. casual in D/N, of course, remains casual.

  2. D/N = Extinct has no alternative in Blackburn et al. (2011). For this, I would leave invasion stage empty and supplement the information with information in occurrenceStatus and establishmentMeans (see table on top). See occurrenceStatus GBIF vocabulary: "An extinct organism is absent while its establishmentMeans is native". Of course, as we are working with invasive species, establishmentMeans would be introduced in this case.

  3. We "ignore" the question marks in 'D/N' (see #12 (comment))

How to express invasionStage for plants?

@qgroom @SoVDH @timadriaens what invasion stage information do we want to extract from the manual of alien plants?

This is the information we have:

invasion stage

  • D/N is information you have already mapped (filter on "plants")
  • presence_value comes from the presence columns Fl, Wa, and Br, which can have x for present and ? for unknown (I assume).
  1. In it's most simple form, we only express that a taxon is:
  • present = x for that region
  • unknown = ? for that region
  1. But we could also take some of the information from the D/N column, to make more nuanced statements, such as established, introduced, to be determined, unknown. Note that even for values in D/N without a question mark, we do have question marks in the presence_value column (and vice versa), so we need to decide which one takes preference.

  2. And finally, a lot of taxa have empty presences for regions. We could either express this as absent from that region, or take the more cautious approach (maybe there was no monitoring for that region) and not say anything about the occurrence of that species in that region.

Could you give some input regarding this, so we know how to basically map the above table to occurrenceStatus (that's the field to use right @qgroom)?

Add genus, specificEpithet and infraspecificEpithet

Could be generated from the scientificName with the GBIF name parser. We could then choose one of the following approaches:

Add as 3 columns to source file

Bulk of it could be populated once with GBIF name parser and then maintained manually.

Main advantage: speed + control

Add as 3 columns in mapping

Would have to be parsed every time (slowing down the publication)

Main advantage: no maintenance

Add speciesProfile information

@qgroom can you ask Filip to add a column to the source file called environment, with the following 3 values:

  • M for isMarine
  • F for isFreshwater
  • T for isTerrestrial

The column can contain multiple values, such as F/T, separated by /.

We will use this information to populate the http://rs.gbif.org/extension/gbif/1.0/speciesprofile.xml


Note: we will also indicate FALSE if something is not indicated, so T will have:

  • isMarine: false
  • isFreshwater: false
  • isTerrestrial: true

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.