Giter Club home page Giter Club logo

rinse-pathways-checklist's Introduction

RINSE - Pathways and vectors of biological invasions in Northwest Europe

Rationale

This repository contains the functionality to standardize the data of Zieritz et al. (2017) to a Darwin Core checklist that can be harvested by GBIF. It was developed for the TrIAS project.

Workflow

source data (transcribed from the original Supplementary Table 2 Word file) β†’ Darwin Core mapping script β†’ generated Darwin Core files

Published datasets

Repo structure

The repository structure is based on Cookiecutter Data Science and the Checklist recipe. Files and directories indicated with GENERATED should not be edited manually.

β”œβ”€β”€ README.md              : Description of this repository
β”œβ”€β”€ LICENSE                : Repository license
β”œβ”€β”€ rinse-pathways-checklist.Rproj : RStudio project file
β”œβ”€β”€ .gitignore             : Files and directories to be ignored by git
β”‚
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ raw                : Source data, input for mapping script
β”‚   └── processed          : Darwin Core output of mapping script GENERATED
β”‚
β”œβ”€β”€ docs                   : Repository website GENERATED
β”‚
└── src
    β”œβ”€β”€ dwc_mapping.Rmd    : Darwin Core mapping script, core functionality of this repository
    β”œβ”€β”€ _site.yml          : Settings to build website in docs/
    └── index.Rmd          : Template for website homepage

Installation

  1. Clone this repository to your computer
  2. Open the RStudio project file
  3. Open the dwc_mapping.Rmd R Markdown file in RStudio
  4. Install any required packages
  5. Click Run > Run All to generate the processed data
  6. Alternatively, click Build > Build website to generate the processed data and build the website in docs/

Contributors

List of contributors

License

MIT License

rinse-pathways-checklist's People

Contributors

lienreyserhove avatar peterdesmet avatar

Watchers

 avatar  avatar  avatar

rinse-pathways-checklist's Issues

Wrong dates in biological invasions paper

The dates for records of Rhododendron ponticum Linnaeus in the Zieritz et al. (2016) dataset need cleaning: for observance in GB the date is 17634, and for Belgium it is 19204. I assume the latter should be 1904, as this is the date of observance in the Netherlands. However, for GB, I have no clue --> I will contact the authors for this and take 1763 for now.

Incorrect CBD pathway mapping

While working on trias package for checklist pathway indicators, I found some irregularities in pathway standardization (see trias-project/indicators#61).

All pathways in RINSE are at level 1, e.g. cbd_2014_pathway:escape, cbd_2014_pathway:corridor, cbd_2014_pathway:release. All except one: cbd_2014_pathway:natural_dispersal as natural_dispersal is a level 2 pathway (actually it should be unaided_natural_dispersal). I propose to change it to cbd_2014_pathway:unaided. It occurs in pathway info of 73 taxa.

Examples:

I put a patch while preprocessing data from unified to indicators, but it would be better to solve it at checklist publication level, isn't? Thanks.

How to map `pathway` information to CBD standard?

In the checklist, the following columns describe the pathway of introduction:

  • pathway:

    • import_release
    • import_escape
    • import_dispersal (i.e. merging categories β€˜corridor’ and β€˜unaided’)
    • import_accidental (i.e. merging categories β€˜contaminant’ and β€˜stowaway’)
  • vector:

    • ornamental (e.g. horticulture)
    • leisure (e.g. hunting, recreational angling)
    • industry (e.g. agriculture, aquaculture, fur farming)
    • biocontrol
    • research

Thus, 4 x 5 = 20 pathway x vector combinations occur in the raw data

With respect to the pathway columns, these can easily be mapped to the CBD standard. The difficult thing is to interpret the vector information, which is not easily matched with the CBD standard. An example:

raw dataset direct match CBD options for mapping to CBD
import_release: biocontrol yes release_biological_control
import_escape: leisure no release_fishery, release_hunting
import_release: industry no release_landscape_improvement, release_other
import_dispersal: ornamental yes corridor
import_dispersal: ornamental yes unaided

For the pathway information I think there's no problem. For the vector information, we have several options:

  1. We do not map the vector information
  2. We do not map the vector information when we don't have a clear match to the CBD standard
  3. We map all vector information by attempting to match it to all possible terms of the CBD standard
  4. We map vector information as given in the raw data (using the 5 categories), which will not always match the CBD standard.

Imo, the last option would be the most correct and the easiest way to do. However, in that case, we deviate from our own TrIAS vocabulary, which follows the CBD standard. In case this is a problem, I would prefer option 2.

Use gather on pathway only

For mapping pathway and vector, it is easier to only gather the pathway in rows (as we will need each one of these) and removing NA, but leaving the vectors as columns.

Code suggestion (from Exploratory):

# 1. Select columns I want (this step is not required)
select(species, pathway_import_release, pathway_import_escape, pathway_accidental, pathway_dispersal, vector_ornamental, vector_leisure, vector_industry, vector_biocontrol, vector_research) %>%

# 2. Transform "Y" values to "T" and make them logicals.
# This step is a bit verbose (and not required), but makes the mapping (step 6) a bit more readable
mutate(
  pathway_import_release = parse_logical(recode(pathway_import_escape, "Y" = "T")),
  pathway_import_escape = parse_logical(recode(pathway_import_escape, "Y" = "T")),
  pathway_accidental = parse_logical(recode(pathway_accidental, "Y" = "T")),
  pathway_dispersal = parse_logical(recode(pathway_dispersal, "Y" = "T")),
  vector_ornamental = parse_logical(recode(vector_ornamental, "Y" = "T")),
  vector_leisure = parse_logical(recode(vector_leisure, "Y" = "T")),
  vector_industry = parse_logical(recode(vector_industry, "Y" = "T")),
  vector_biocontrol = parse_logical(recode(vector_biocontrol, "Y" = "T")),
  vector_research = parse_logical(recode(vector_research, "Y" = "T"))
) %>%

# 3. Gather pathway, remove NA
gather(pathway, value, starts_with("pathway_"), na.rm = TRUE, convert = TRUE) %>%

# 4. Column "value" will only contain "TRUE" (or "Y" if you skip step 2), so no need for this column
select(-value) %>%

# 5. Arrange by species to see things more in context (not required)
arrange(species) %>%

# 6. Mapping itself (7 instead of 11 steps). Maybe include an _else_ at the bottom
mutate(CBD = case_when(
    pathway == "pathway_accidental" ~ "stowaway,contaminant",
    pathway == "pathway_dispersal" ~ "corridor,natural_dispersal",
    pathway == "pathway_import_escape" & vector_leisure ~ "escape_food_bait",
    pathway == "pathway_import_escape" & vector_research ~ "escape_research",
    pathway == "pathway_import_escape" ~ "escape",
    pathway == "pathway_import_release" & vector_biocontrol ~ "release_biocontrol",
    pathway == "pathway_import_release" ~ "release"
)) %>%

# 7. Separating "stowaway,contaminant", ... into two columns
separate(CBD, into = c("CBD_1", "CBD_2"), sep = "\\s*\\,\\s*", remove = TRUE, convert = TRUE) %>%

# 8. Gather 2 columns into 2 rows
gather(key, value, starts_with("cbd_"), na.rm = TRUE, convert = TRUE) %>%

# 9. Sort to show context per species
arrange(species)

Cleaning steps references

Some feedback needed:

The Zieritz et al. (2016) checklist has a reference column containing numbers. Two things with respect to that:

  1. The numbers are separated by comma's and hyphens. The hyphen is used to indicate a sequence, i.e. 1-4 refers to references 1, 2, 3 and 4. We need the latter. I didn't figure out yet how I can generate these sequences in an way that makes the code readable. Thus, I suggest to generate the sequences in the raw data file, rather than performing the cleaning in the R script (which makes it more messy). As this is a dead dataset, I think the cleaning step won't harm.

  2. For some species, about 12 reference numbers are provided, which is a lot. Just to be sure, is it really necessary to integrate the full reference? The fields will be full of text, but I guess there's no other way around that right?

Some names don't have spaces before (

Can be solved with by adding this step BEFORE generating taxon IDs:

# add space before every (, then remove double spaces
mutate(species = str_replace_all(species, "\\(", " ("), species = str_replace_all(species, "  ", " "))

Native range = "Ar"

Two taxa have a native range = "Ar", i.e. Coregonus nasus and Salvelinus alpinus
I didn't find a legend for the abbreviations of native ranges. As there are five native ranges discussed in the result section of the article, and as there are six abbreviations, I suspect Ar is a typo.
However, which native range would it represent then? Africa? America? Asia?

Change rightsHolder and institutionCode

We decided the following for all TrIAS source checklists:

publisher = institutionCode = rightsHolder = the org that had or was granted the permission to publish the data under the license it has.

Should be integrated accordingly
See this issue

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.