trias-project / alien-species-checklist Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 2.0 13.25 MB

🐞 Proof of concept for a checklist of alien species in Belgium

License: MIT License

Jupyter Notebook 95.73% Python 0.44% R 3.83%

alien-species-checklist's People

Contributors

Stargazers

Watchers

Forkers

qgroom mohan7098

alien-species-checklist's Issues

Create concatenated file

This can only be done once all the other issues have been resolved.

Match concatenated list to GBIF

Are all records in Annex B included in Neobiota?

If yes, we can ignore the Annex B file in our combined checklist. @stijnvanhoey will do a join on scientificNames to see how many match.

GRIIS mapping

We will add the following columns for species that occur in GRIIS and the concatenated file (based on identical acceptedKey):

PresenceBE: one of our 7 columns
Status: standardized
Source: dataset name from which this info was derived
GBIF acceptedKey
Our taxonID ?

Order of sources to use info from:

Plants
Wrims
Fishes
Macroinvertebrates (all considered established)
Harmonia (beware, not all present in BE)
Rinse + annex: no status info

Matches

Records in both lists considered validated
Records in GRIIS only: are probably errors in one of both datasets
Records in concatenated only: we will submit this as an additional file: potentially missing from GRIIS.

Populate scientificName

Create a scientificName for all records in concatenated where there currently isn't one. Since none of those have a subspecies, this can be done quite easily:

Filter on blank scientificName
Edit cells > Transform
value: cells["genus"].value + " " + cells["specificEpithet"].value

Make sure to escape quotes before importing the file, so we don't get this in taxonID:

plants "Populus x jackii Sargent ""Gileadensis"" (P. balsamifera x deltoides)" Salicaceae Nat.? Hort. D NAM X 2010 N?

What filter to use for WRIMS?

In WRIMS, one can search on distributions: http://www.marinespecies.org/introduced/

There are 4 (sub)regions that might be applicable for a Belgian checklist:

Belgian coast (Coast): 13 records
Belgian Exclusive Economic Zone (EEZ): 55 records
Belgian part of the North Sea (Marine Region): 55 records
Belgium (Nation): 114 records

I verified and Belgium (Nation) includes the 3 other regions completely, so a search on Belgium (nation) is the one we should use.

Create list of all "presence"

Once we have #26:

Do a facet on presenceBE
Copy all the values and the number of times they occur
Paste this information in the relevant sheet in this spreadsheet
In the spreadsheet, indicate the field name in the column Field
Repeat this for the 6 other presence columns.

Transform WRIMS data to flat CSV

Create one flat file (source format unknown)
Transform to CSV
Put CSV in https://github.com/LifeWatchINBO/alien-species-checklist/tree/master/source-datasets/wrims

Standardize to common terms: macroinvertebrates

Using this file:

Rename the columns we want to keep to the common term names
Remove the columns we are not planning to use
Add the other common terms as columns
Populate datasetName with macroinvertebrates
Order the columns alphabetically
Export the final file to the appropriate directory as data-with-common-terms.tsv
Store the refine steps as data-with-common-terms-refine.json

quick clean google refine

First scan & basic cleaning of the initial files

Belgian Alien Species Checklist

Dear Tim
I would like you to help to define the main columns tittles (Darwin Core Terms) we would need for the checklist.

I try to do till the Assignment # 30. Peter, you can check the spread sheet for comments
For the following, I think, I'm not technically qualified for them; Right?
If I have to do them, I need to be at INBO, with one of you, Stijn, Peter or Dimitri
This is possible by next Monday. Or Stijn has already completed them?
Anyway, I would like to know how these have been done

Thanks

Transform Harmonia data to flat csv

Create one flat file (source format unknown)
Transform to CSV
Put CSV in https://github.com/LifeWatchINBO/alien-species-checklist/tree/master/source-datasets/harmonia

Standardize to common terms: plants

Using this file:

Rename the columns we want to keep to the common term names
Remove the columns we are not planning to use
Add the other common terms as columns
Populate datasetName with plants
Order the columns alphabetically
Export the final file to the appropriate directory as data-with-common-terms.tsv
Store the refine steps as data-with-common-terms-refine.json

Create list of all "introductionPathways"

Once we have #26:

Do a facet on introductionPathways
Copy all the values and the number of times they occur
Paste this information in the relevant sheet in this spreadsheet

GBIF match results for concatenated file

This table gives an overview of the current GBIF name matching status:

match	fishes	harmonia	macroinvertebrates	plants	rinse	rinse-annex-b	wrims	sum
exact match with gbifapi_scientificName			1	1294	36	1		1332
exact match with gbifapi_canonicalName	22	130	66	3	6093	21	175	6510
EXACT 100%		4		849	211	6		1070
EXACT < 100%				72	20	4		96
FUZZY	1		2	15	73	4		95
HIGHERRANK		3	4	177	224	7	25	440
NO OR DOUBLE MATCH		3			4	2		9
sum	23	140	73	2410	6661	45	200	9552

Some observations:

The first 3 categories can be considered OK, which is 93,3% of the dataset! The only caveat is that we have to trust the accepted names GBIF gives for synonyms (745 records + 15 doubtful), which we don't always do: e.g. Tripolium pannonicum is not a synonym of A. salignus
The 96 EXACT < 100% matches will have to be examined case by case.
The 95 FUZZY matches are probably typos, and are addressed in #41
The 440 HIGHERRANK matches are mostly plants and a lot of hybrids. Chances are we can only correct half of those to match in GBIF.
And then there are 9 records with no match: those are some viruses and names in Harmonia that appear twice in GBIF (a bug that will be fixed in April).

@timadriaens, how do you want to prioritize going forward?

GRISS dataset absence of a taxonID

In order to do the matching of the GRISS dataset with GBIF, some transformation is needed. However, @timadriaens, is there any taxonID in the current set to use as an identifier?

Add presence to WRIMS dataset

We want to add presence data to the WRIMS dataset:

Open this file in Open Refine: https://github.com/LifeWatchINBO/alien-species-checklist/blob/master/source-datasets/wrims/data.tsv
Do a facet on PlaceName
Filter on Belgium and add a column based on this column, named presenceBE
Filter on Belgian part of the North Sea and add a column based on this column, named presenceBPNS
Filter on Belgian Exclusive Economic Zone and add a column based on this column, named presenceBEEZ
Filter on Belgian Coast and add a column based on this column, named presenceBECoast

Once this is done:

Copy and apply https://github.com/LifeWatchINBO/alien-species-checklist/blob/master/source-datasets/wrims/data-with-common-terms-refine.json
Move the columns presenceBE, presenceBEEZ, presenceBPNSandpresenceBECoast` to the correct position (see https://github.com/LifeWatchINBO/alien-species-checklist#process)
Export the data as data-with-common-terms.tsv
Export the Refine as data-with-common-terms-refine.json

concatenated tsv

When I open the concatenated file in Refine, I still have the same problem: 1256 records rather than about 9000 and more.

Remove decimals from years

Because of a Refine issue in automatically interpreting data from Excel, years are written as 2015.0. We need to remove those decimals.

Plants
WRIMS

For vocabs add source

Transform RINSE data to flat csv

Combine Excel sheets into one flat file
Transform Excel file to CSV
Put CSV in https://github.com/LifeWatchINBO/alien-species-checklist/tree/master/source-datasets/rinse

First summary of GBIF match of waarnemingen.be species

matchType	confidence	status	records
EXACT	100	ACCEPTED	20021
EXACT	100	SYNONYM	1390
EXACT	<100	ACCEPTED	1775
EXACT	<100	DOUBTFUL	34
EXACT	<100	SYNONYM	73
FUZZY	<100	ACCEPTED	129
FUZZY	<100	SYNONYM	21
HIGHERRANK	100	ACCEPTED	398
HIGHERRANK	100	DOUBTFUL	2
HIGHERRANK	100	SYNONYM	44
HIGHERRANK	<100	ACCEPTED	281
HIGHERRANK	<100	SYNONYM	22
blank			990

Get WRIMS data

Question is asked by @DimEvil to VLIZ. Ideally there is a public bulk dataset.

Return file format for GRIIS?

In what format should the GRISS information be provided when returning? is CSV appropriate, or should it be excel?

Waarnemingen.be mapping

We will add the following columns for species that occur in wn.be and the concatenated file (based on identical acceptedKey:

Status: standardized
GBIF acceptedKey
Our taxonID ?

Order of sources to use info from:

Plants
Wrims
Fishes
Macroinvertebrates (all considered established)
Harmonia (beware, not all present in BE)
Rinse + annex: no status info

Matches

Records in both lists ok: maybe check discrepancy in status wn.be and ours
Records in wn.be only: mostly natives, check those flagged by wn.be as exotic
Records in concatenated only: ok

Standardize to common terms: fishes

Using this file:

Rename the columns we want to keep to the common term names
Remove the columns we are not planning to use
Add the other common terms as columns
Populate datasetName with fishes
Order the columns alphabetically
Export the final file to the appropriate directory as data-with-common-terms.tsv
Store the refine steps as data-with-common-terms-refine.json

Create list of all "origin"

Once we have #26:

Do a facet on origin
Copy all the values and the number of times they occur
Paste this information in the relevant sheet in this spreadsheet

Standardize "introductionPathways"

Standardize to common terms: wrims

Using this file:

Rename the columns we want to keep to the common term names
Remove the columns we are not planning to use
Add the columns Belgian exclusive economic zone; Belgian Part of the North sea; Belgian coast
Add the other common terms as columns
Populate datasetName with wrims
Order the columns alphabetically
Export the final file to the appropriate directory as data-with-common-terms.tsv
Store the refine steps as data-with-common-terms-refine.json

alien_plants_meise

What are the meaning of the columns "D/N" and "V/I"?

Transform plants data to flat sv

Combine Excel sheets into one flat file
Transform Excel file to CSV
Put CSV in https://github.com/LifeWatchINBO/alien-species-checklist/tree/master/source-datasets/plants

AnnexB RINSE Registry of NNS (Cleaning in refine)

Hello Dear Tim.
Do we need to keep other countries than Belgium? (ie: GB, France, Netherlands)

Standardize "status"

Which file to use for RINSE

@timadriaens, I also asked you this by email, but I'm recording it here so we won't forget. What file do we use for the RINSE dataset?

File emailed by you: https://github.com/LifeWatchINBO/alien-species-checklist/blob/master/source-datasets/rinse/AnnexB%20RINSE%20Registry%20of%20NNS.xlsx
File in supplementary material of Zieritz et al.: https://github.com/LifeWatchINBO/alien-species-checklist/blob/master/source-datasets/rinse/neobiota-023-065-s001.xlsx

My preference would be 2, as that is publicly available and easier to reference. I just need to know that all the core info is there.

Create list of all "habitat"

Once we have #26:

Do a facet on habitat
Copy all the values and the number of times they occur
Paste this information in the relevant sheet in this spreadsheet

Get Harmonia data

Is the Harmonia dataset publicly available in bulk? If not, how can we obtain this?

Create list of all "status"

Once we have #26:

Do a facet on status
Copy all the values and the number of times they occur
Paste this information in the relevant sheet in this spreadsheet

GBIF match with fishes

23 records:

OK

20 have an exact match with gbifapi_canonicalName which is ACCEPTED.

Fuzzy

19 Proterorhinus semilunairs → Proterorhinus semilunaris Correct typo

Synonyms

0 Acipenser baeri → Acipenser baerii: Correct typo
1 Acipenser guldenstaedti → Acipenser gueldenstaedtii: Correct typo

For the macroinvertebrate data, we are currently relying on a final proof paper emailed by Tim: https://github.com/LifeWatchINBO/alien-species-checklist/blob/master/source-datasets/macroinvertebrates/AI15-039_Boets_etal_almost%20ready%207%20Dec%202015_Tim.doc

When Boets et al. is published, we should use the actual pdf paper as the source, as that is easier to reference.

Add acceptedKey to concatenated list

Transform macroinvertebrates data to flat csv

Extract data from table in article
Transform to CSV
Put CSV in https://github.com/LifeWatchINBO/alien-species-checklist/tree/master/source-datasets/macroinvertebrates

Match waarnemingen.be species to GBIF

Input file is at https://github.com/LifeWatchINBO/alien-species-checklist/blob/master/reference-datasets/natuurpunt/waarnemingen-be-species.csv

GBIF match with Harmonia

140 records:

Considered OK

119 records are exact match with gbifapi_canonicalName + ACCEPTED: 114 SPECIES and 5 GENUS. The genera got a lower confidence level, but are all correct.
2 records have EXACT 100% + ACCEPTED. Those are nothotaxa and are also correctly matched.

Synonyms

Other issues

74 Elodea canadensis → Multiple matches, hardcode to http://www.gbif.org/species/2865448 = http://mdoering.github.io/nub-browser/app/#/taxon/2865448
88 Hyacinthoides hispanica → Multiple matches, hardcode to http://www.gbif.org/species/5304257 = http://mdoering.github.io/nub-browser/app/#/taxon/5304257
143 Rudbeckia laciniata → Higher taxon: Rudbeckia. Multiple matches in fact, hardcode to http://www.gbif.org/species/3114229 = http://mdoering.github.io/nub-browser/app/#/taxon/3114229
107 Mimulus guttatus → Multiple matches, http://www.gbif.org/species/7333256 will disappear: http://mdoering.github.io/nub-browser/app/#/taxon/7333256. Update to Erythranthe guttata
36 Aster americ. → Higher taxon: Aster No variations can be found. Removed from list (i.e. name emptied)
127 Pinus nigra nigra → Higher taxon: Pinus nigra Can match to subsp, form or var: http://mdoering.github.io/nub-browser/app/#/search/Pinus%20nigra%20nigra Higher match considered ok

Standardize to common terms: harmonia

Using this file:

Rename the columns we want to keep to the common term names
Remove the columns we are not planning to use
Add the other common terms as columns
Populate datasetName with harmonia
Order the columns alphabetically
Export the final file to the appropriate directory as data-with-common-terms.tsv
Store the refine steps as data-with-common-terms-refine.json

Standardize to common terms: rinse-annex-b

Using this file:

Rename the columns we want to keep to the common term names
Remove the columns we are not planning to use
Add the other common terms as columns
Populate datasetName with rinse-annex-b
Order the columns alphabetically
Export the final file to the appropriate directory as data-with-common-terms.tsv
Store the refine steps as data-with-common-terms-refine.json

Standardize to common terms: rinse

Using this file:

Rename the columns we want to keep to the common term names
Remove the columns we are not planning to use
Add the other common terms as columns
Populate datasetName with rinse
Order the columns alphabetically
Export the final file to the appropriate directory as data-with-common-terms.tsv
Store the refine steps as data-with-common-terms-refine.json

Correct typos in data (i.e. fuzzy matches)

Select in concatenated.tsv file for the fuzzy matches
Compare the original scientificName with the scientificName_gbif
double check the names (eol.org , google...)
Correct typos!

(17 fuzzy matches)

trias-project / alien-species-checklist Goto Github PK

alien-species-checklist's People

Contributors

Stargazers

Watchers

Forkers

alien-species-checklist's Issues

Matches

Matches

OK

Fuzzy

Synonyms

Considered OK

Synonyms

Other issues

Recommend Projects

Recommend Topics

Recommend Org