trias-project / rinse-pathways-checklist Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 0.0 2.24 MB

🚢 RINSE - Pathways and vectors of biological invasions in Northwest Europe

Home Page: https://trias-project.github.io/rinse-pathways-checklist

License: MIT License

CSS 100.00%

dataset checklist r gbif rstats oscibio invasive-species

rinse-pathways-checklist's Introduction

RINSE - Pathways and vectors of biological invasions in Northwest Europe

Rationale

This repository contains the functionality to standardize the data of Zieritz et al. (2017) to a Darwin Core checklist that can be harvested by GBIF. It was developed for the TrIAS project.

Workflow

source data (transcribed from the original Supplementary Table 2 Word file) → Darwin Core mapping script → generated Darwin Core files

Published datasets

Repo structure

The repository structure is based on Cookiecutter Data Science and the Checklist recipe. Files and directories indicated with GENERATED should not be edited manually.

├── README.md              : Description of this repository
├── LICENSE                : Repository license
├── rinse-pathways-checklist.Rproj : RStudio project file
├── .gitignore             : Files and directories to be ignored by git
│
├── data
│   ├── raw                : Source data, input for mapping script
│   └── processed          : Darwin Core output of mapping script GENERATED
│
├── docs                   : Repository website GENERATED
│
└── src
    ├── dwc_mapping.Rmd    : Darwin Core mapping script, core functionality of this repository
    ├── _site.yml          : Settings to build website in docs/
    └── index.Rmd          : Template for website homepage

Installation

Clone this repository to your computer
Open the RStudio project file
Open the dwc_mapping.Rmd R Markdown file in RStudio
Install any required packages
Click Run > Run All to generate the processed data
Alternatively, click Build > Build website to generate the processed data and build the website in docs/

Contributors

List of contributors

License

MIT License

rinse-pathways-checklist's People

Contributors

Watchers

rinse-pathways-checklist's Issues

Wrong dates in biological invasions paper

The dates for records of Rhododendron ponticum Linnaeus in the Zieritz et al. (2016) dataset need cleaning: for observance in GB the date is 17634, and for Belgium it is 19204. I assume the latter should be 1904, as this is the date of observance in the Netherlands. However, for GB, I have no clue --> I will contact the authors for this and take 1763 for now.

Incorrect CBD pathway mapping

While working on trias package for checklist pathway indicators, I found some irregularities in pathway standardization (see trias-project/indicators#61).

All pathways in RINSE are at level 1, e.g. cbd_2014_pathway:escape, cbd_2014_pathway:corridor, cbd_2014_pathway:release. All except one: cbd_2014_pathway:natural_dispersal as natural_dispersal is a level 2 pathway (actually it should be unaided_natural_dispersal). I propose to change it to cbd_2014_pathway:unaided. It occurs in pathway info of 73 taxa.

Examples:

I put a patch while preprocessing data from unified to indicators, but it would be better to solve it at checklist publication level, isn't? Thanks.

How to map `pathway` information to CBD standard?

In the checklist, the following columns describe the pathway of introduction:

pathway:
- import_release
- import_escape
- import_dispersal (i.e. merging categories ‘corridor’ and ‘unaided’)
- import_accidental (i.e. merging categories ‘contaminant’ and ‘stowaway’)
vector:
- ornamental (e.g. horticulture)
- leisure (e.g. hunting, recreational angling)
- industry (e.g. agriculture, aquaculture, fur farming)
- biocontrol
- research

Thus, 4 x 5 = 20 pathway x vector combinations occur in the raw data

With respect to the pathway columns, these can easily be mapped to the CBD standard. The difficult thing is to interpret the vector information, which is not easily matched with the CBD standard. An example:

raw dataset	direct match CBD	options for mapping to CBD
import_release: biocontrol	yes	release_biological_control
import_escape: leisure	no	release_fishery, release_hunting
import_release: industry	no	release_landscape_improvement, release_other
import_dispersal: ornamental	yes	corridor
import_dispersal: ornamental	yes	unaided

For the pathway information I think there's no problem. For the vector information, we have several options:

We do not map the vector information
We do not map the vector information when we don't have a clear match to the CBD standard
We map all vector information by attempting to match it to all possible terms of the CBD standard
We map vector information as given in the raw data (using the 5 categories), which will not always match the CBD standard.

Imo, the last option would be the most correct and the easiest way to do. However, in that case, we deviate from our own TrIAS vocabulary, which follows the CBD standard. In case this is a problem, I would prefer option 2.

Use gather on pathway only

For mapping pathway and vector, it is easier to only gather the pathway in rows (as we will need each one of these) and removing NA, but leaving the vectors as columns.

Code suggestion (from Exploratory):

# 1. Select columns I want (this step is not required)
select(species, pathway_import_release, pathway_import_escape, pathway_accidental, pathway_dispersal, vector_ornamental, vector_leisure, vector_industry, vector_biocontrol, vector_research) %>%

# 2. Transform "Y" values to "T" and make them logicals.
# This step is a bit verbose (and not required), but makes the mapping (step 6) a bit more readable
mutate(
  pathway_import_release = parse_logical(recode(pathway_import_escape, "Y" = "T")),
  pathway_import_escape = parse_logical(recode(pathway_import_escape, "Y" = "T")),
  pathway_accidental = parse_logical(recode(pathway_accidental, "Y" = "T")),
  pathway_dispersal = parse_logical(recode(pathway_dispersal, "Y" = "T")),
  vector_ornamental = parse_logical(recode(vector_ornamental, "Y" = "T")),
  vector_leisure = parse_logical(recode(vector_leisure, "Y" = "T")),
  vector_industry = parse_logical(recode(vector_industry, "Y" = "T")),
  vector_biocontrol = parse_logical(recode(vector_biocontrol, "Y" = "T")),
  vector_research = parse_logical(recode(vector_research, "Y" = "T"))
) %>%

# 3. Gather pathway, remove NA
gather(pathway, value, starts_with("pathway_"), na.rm = TRUE, convert = TRUE) %>%

# 4. Column "value" will only contain "TRUE" (or "Y" if you skip step 2), so no need for this column
select(-value) %>%

# 5. Arrange by species to see things more in context (not required)
arrange(species) %>%

# 6. Mapping itself (7 instead of 11 steps). Maybe include an _else_ at the bottom
mutate(CBD = case_when(
    pathway == "pathway_accidental" ~ "stowaway,contaminant",
    pathway == "pathway_dispersal" ~ "corridor,natural_dispersal",
    pathway == "pathway_import_escape" & vector_leisure ~ "escape_food_bait",
    pathway == "pathway_import_escape" & vector_research ~ "escape_research",
    pathway == "pathway_import_escape" ~ "escape",
    pathway == "pathway_import_release" & vector_biocontrol ~ "release_biocontrol",
    pathway == "pathway_import_release" ~ "release"
)) %>%

# 7. Separating "stowaway,contaminant", ... into two columns
separate(CBD, into = c("CBD_1", "CBD_2"), sep = "\\s*\\,\\s*", remove = TRUE, convert = TRUE) %>%

# 8. Gather 2 columns into 2 rows
gather(key, value, starts_with("cbd_"), na.rm = TRUE, convert = TRUE) %>%

# 9. Sort to show context per species
arrange(species)

Update hybrid taxonRank

In response to gbif/portal-feedback#1354 (comment), change taxonRank from hybrid to species for:

Fallopia X bohemica (Chrtek & Chrtková) Bailey
Euphorbia X pseudovirgata (Schur)
Aster X salignus Willd.

That should solve rank unknown issues.

Cleaning steps references

Some feedback needed:

The Zieritz et al. (2016) checklist has a reference column containing numbers. Two things with respect to that:

The numbers are separated by comma's and hyphens. The hyphen is used to indicate a sequence, i.e. 1-4 refers to references 1, 2, 3 and 4. We need the latter. I didn't figure out yet how I can generate these sequences in an way that makes the code readable. Thus, I suggest to generate the sequences in the raw data file, rather than performing the cleaning in the R script (which makes it more messy). As this is a dead dataset, I think the cleaning step won't harm.
For some species, about 12 reference numbers are provided, which is a lot. Just to be sure, is it really necessary to integrate the full reference? The fields will be full of text, but I guess there's no other way around that right?

Some names don't have spaces before (

Can be solved with by adding this step BEFORE generating taxon IDs:

# add space before every (, then remove double spaces
mutate(species = str_replace_all(species, "\\(", " ("), species = str_replace_all(species, "  ", " "))

Add license + institutionCode

I think a decision has been taken on how to populate these fields, but is not included in first mapping

Native range = "Ar"

Two taxa have a native range = "Ar", i.e. Coregonus nasus and Salvelinus alpinus
I didn't find a legend for the abbreviations of native ranges. As there are five native ranges discussed in the result section of the article, and as there are six abbreviations, I suspect Ar is a typo.
However, which native range would it represent then? Africa? America? Asia?

NA environment maps to different species profiles

I noticed in https://trias-project.github.io/rinse-pathways-checklist/dwc_mapping.html#6_create_species_profile_extension

raw_environment	isMarine	isFreshwater	isTerrestrial	records
NA	FALSE	FALSE	TRUE	1
NA	TRUE	TRUE	FALSE	1

Are empty environments differently? Or is that because of the Rattus and Salvelinus?

Change rightsHolder and institutionCode

We decided the following for all TrIAS source checklists:

publisher = institutionCode = rightsHolder = the org that had or was granted the permission to publish the data under the license it has.

Should be integrated accordingly
See this issue

trias-project / rinse-pathways-checklist Goto Github PK

rinse-pathways-checklist's Introduction

RINSE - Pathways and vectors of biological invasions in Northwest Europe

Rationale

Workflow

Published datasets

Repo structure

Installation

Contributors

License

rinse-pathways-checklist's People

Contributors

Watchers

rinse-pathways-checklist's Issues

Recommend Projects

Recommend Topics

Recommend Org