usfws / fwspp Goto Github PK
View Code? Open in Web Editor NEWQuery species occurrence observations on USFWS properties
License: Creative Commons Zero v1.0 Universal
Query species occurrence observations on USFWS properties
License: Creative Commons Zero v1.0 Universal
Apparently, FWS has a mirror (?) of the NPS taxonomy?
Base url: https://ecos.fws.gov/ServCatServices/v2/rest/taxonomy/searchByScientificName/
Should be a drop in replacement into nps_taxonomy
, so simply need to rebranding and documentation updates?
Presumably this applies to nps_taxonomy_by_code
as well?
The FWSpecies database won't permit abundance values unless a species is "Present". Options are to upgrade the occurrence value OR move abundance values to the Abundance Notes field.
A record needs a Nativeness value to be approved. Set all those with missing values to "unknown" or equivalent (see tags in #22)...
The add_taxonomy function is returning the error "Taxonomy retrieval failed with the following error:
at least one vector element is required"
The error may be the result of the left_join in the nested join_taxonomy function.
We've taken great pains to catch errors during the fws_occ
run. Let's create a fxn to check or review a fws_occ
run and report to the user which properties had issues.
fwspp_combine
will fail reasonably (but vaguely) when trying to combine fwspp
objects with captured errors, but fwspp_review
(and specifically xlsx_review
) does not...
Need the cadastral to do anything, so figure it out...
Once the spreadsheet for review has been reviewed, and records updated/accepted/rejected, need functionality to:
Expect some changes to Seasonality (Occurrence Class), Origin (Nativeness), and Management tags in the FWSpecies application that will need to be accounted for, most noticeably in the xlsx_review_tags
and add_review_validation
functions but possibly elsewhere (e.g., review_helpers.R
, fwspp_review.R
)...
I'm not sure we incorporate Seasonality (i.e., "Occurrence Class") tags yet, though we probably should... ditto for "Management"
Installation of fwspp
errors out because CRAN removed the ecoengine
package (https://cran.r-project.org/web/packages/ecoengine/index.html). There are archived versions of the package that can be installed (https://cran.r-project.org/src/contrib/Archive/ecoengine/).
The fws_occ function uses the public facing ServCat API to extract taxonomic names for the given refuge using the unit code. The records are pulled from ServCat using the following constraints:
Example evidence links that are broken have the following prefixes:
Sachuest<-find_fws("Sachuest")
Sachuest_occ <- fws_occ(Sachuest)
causes the following error:
1 properties will be queried:
Sachuest Point NWR (R5)
Processing Sachuest Point NWR
Spherical geometry (s2) switched off
Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is FALSE
Splitting property for more efficient queries.
Spherical geometry (s2) switched on
Server request timeout set to 3 seconds (x4 for GBIF).
Querying the Global Biodiversity Information Facility (GBIF)...
Retrieving 437 records.
Querying Integrated Digitized Biocollections (iDigBio)...
No records found.
Taxonomy retrieval failed with the following error:
at least one vector element is required
Skipping taxonomy. Please send the resulting fwspp
object to the maintainer of the
fwspp
package. You may also try again later using fwspp::add_taxonomy
.
Sachuest_occ
$SACHUEST POINT NATIONAL WILDLIFE REFUGE
<simpleError in if (nrow(idb_recs) > 0) idb_recs <- clean_iDigBio(idb_recs) else idb_recs <- NULL: argument is of length zero>
attr(,"class")
[1] "fwspp"
attr(,"boundary")
[1] "admin"
attr(,"scrubbing")
[1] "strict"
attr(,"buffer_km")
[1] 0
attr(,"query_dt")
[1] "2023-07-20 09:07:05 EDT"
After updating sf and s2 the following error is returned when running install_fws_cadastral:
USFWS Cadastral Database downloaded and installed successfully.
Spherical geometry (s2) switched off
Storing USFWS cadastral geodatabase in a more efficient format. This will take several additional
minutes.
Processing USFWS Interest boundaries.
Error in scan(text = lst[[length(lst)]], quiet = TRUE) :
scan() expected 'a real', got 'ParseException:'
Error in (function (msg) : ParseException: Unknown WKB type 12
The legacy packages maptools, rgdal, and rgeos, underpinning this package
will retire shortly. Please refer to R-spatial evolution reports on
https://r-spatial.org/r/2023/05/15/evolution4.html for details.
This package is now running under evolution status 0
Retaining may be as easy as modifying clean_sci_name
to retain trinomials, but it will also be necessary to make accommodating changes in several of the taxonomy functions. For example, may have to count words in the scientific name to know whether to retain species or subspecies records...
Currently based on GBIF but rgbif::occ_search only retrieves 300 records at a time due to the page size offered by the GBIF API.
We'll use similar functionality by calling National Park Service webservice to retrieve taxon codes, which also contains useful ITIS info (e.g., valid Scientific Name, Common Name, TSN, rank, etc...)
URLs without the leading http/https
are corrupted by ECOS, so retain them...
fwspp::prep_cadastral(fwspp::find_fws("tetlin"), "admin", T)
#> Error in CPL_geos_union(st_geometry(x), by_feature): Evaluation error: TopologyException: Input geom 0 is invalid: Hole lies outside shell at or near point -141.55936108399999 62.771209905000035 at -141.55936108399999 62.771209905000035.
This mainly affects GBIF and BISON queries. Is it generally true that the direct media URL is accessible from the general record URL (e.g., as in iNaturalist)? If so, it may be unnecessary to go fishing for direct media URLs (and faster, in the case of rgbif::occ_data
)...
Many links returning more than one link seperated by a comma or semicolon
Store as an attribute the datetime a query was initiated. Subsequently use this to pass to the get_*
functions to, I suspect, profoundly reduce the size and increase the speed of query updates.
lastInterpreted
parameterdatemodified
in iDigBio API, but need to check furtherlastindexed
is appropriate VertNet parameter?fwspp::retrieve_taxonomy("Solidago graminifolia")
#> sci_name acc_sci_name
#> 1 Solidago graminifolia Euthamia graminifolia
#> com_name rank
#> 1 NA, flattop goldentop, flat-top goldentop, slender goldentop Species
#> category taxon_code tsn note
#> 1 Vascular Plant 140446 37352 <NA>
Appears to occur when no common name is found for the original taxon but common names are found for the accepted taxon. Probably a simple na.omit
fix...
Add function to:
Some properties are relatively small in actual area compared to the area subsumed by their convex hulls. Two very good examples are Great Thicket and Blackwater.
Maybe partition a MULTIPOLYGON
into component polygons if the component polygon area is below some threshold of the MULTIPOLYGON
bounding box area? For example, the corresponding percentages for Great Thicket and Blackwater are ~1.5% and 4%, respectively. Could possibly ignore this complication if the number of records was relatively small (< 500K maybe) or the absolute bounding box area was relatively small as well...
This split should occur prior to, and not affect, possible temporally-split queries by get_GBIF
.
Alaska Maritime seems related to crossing the international date line, and thus may be best incorporated with the solution to #2.
Items without details spawned a 500 Server error
.
ALASKA MARITIME NATIONAL WILDLIFE REFUGE
BAKER ISLAND NATIONAL WILDLIFE REFUGE
BRETON NATIONAL WILDLIFE REFUGE
(Request Entity Too Large
: WKT too large)HOWLAND ISLAND NATIONAL WILDLIFE REFUGE
IZEMBEK NATIONAL WILDLIFE REFUGE
(Request Entity Too Large
)JARVIS ISLAND NATIONAL WILDLIFE REFUGE
JOHNSTON ATOLL NATIONAL WILDLIFE REFUGE
KINGMAN REEF NATIONAL WILDLIFE REFUGE
MARIANA ARC OF FIRE NATIONAL WILDLIFE REFUGE
(Request-URI Too Long
)MIDWAY ATOLL NATIONAL WILDLIFE REFUGE
NAVASSA ISLAND NATIONAL WILDLIFE REFUGE
(Request Entity Too Large
)PALMYRA ATOLL NATIONAL WILDLIFE REFUGE
SUSQUEHANNA NATIONAL WILDLIFE REFUGE
(Request Entity Too Large
)WAKE ATOLL NATIONAL WILDLIFE REFUGE
Way too convoluted error handling. The way it's set up, if some stage of manage_gets
fails, that should be captured and the process broken, saved, and moved along. Thus, it seems that performing manage_gets
safely will be adequate. Removal of purrr::safely
from get_verb_N
will affect several files in many locations, but worth it for clarity's sake...
The workbook created by the xlsx_submission function now contains the "SpeciesListForImport" tab. Each row in this tab is associated with a unique taxon observation, and the data are formatted to be consistent with the FWSpecies bulk submission template.
Users are currently required to download the entire FWS cadastral dataset from ServCat. It would be more efficient to download specific refuge data using the REST API from ArcGIS Online.
Add:
Update:
Most of the data associated with http://bins.boldsystems.org link seem to be retuning the wrong field in the sci_name. For example, the observation that returned the link: "http://bins.boldsystems.org/index.php/Public_RecordView?processid=UAMIC444-13" returns a sci_name value of "Bold:aaa2226".
All are associated with bio_repo="GBIF"
The package current depends on sp and geos, which will be deprecated in Oct 2023. We need to update all spatial functions to sf.
The workbook generated by the xlsx_submission function now includes a "FWSpecies" tab that contains evidence links associated with taxa already in FWSpecies for the given refuge. These evidence links can be used to update current FWSpecies records.
From Sarah Shultz:
The database has a slightly picky rule that requires that a nativeness value be assigned if a record is approved. Here are a couple of options, let me know if either are agreeable:
- Set nativeness to "unknown" (where currently null) and then update in the future if/when someone has time to seek out this information
- Leave nativeness blank and set the record status to "in review" (but remember, then the records won't show up on the basic checklist)
In short, when processing FWSpecies reviews for submission with fwspp_submission
, we need to give the user one of the above options, with the first above as the default.
THIS ISSUE IS A WORK IN PROGRESS
Currently, output fwspp
object is a list of data.frames (one per property) that may have taxonomic information. EDIT: Require them to have taxonomy. Review is frustrating without it...
Need option to take this object and export to spreadsheet for review with the following columns:
org_name
(narrow column since it'll be superfluous, but necessary when re-importing)category
taxon_code
sci_name
com_name
occurrence
(as evaluated when compared against existing FWSpecies records for property)nativeness
(imported from existing FWSpecies [?] or blank for new records)accept_record
(defaults to YES
for all retained records)evidence
(or ExternalLinks
)note
(notes will be useful for identifying records that may be corrected at this stage)Should account for possibility that some will have taxonomy and some will not...
There are several hundred instances of GRBIO institutions with the same acronym. These are causing problems during cleaning. There's no obvious way to assign unique acronyms and still link to the occurrence data based on the institution code.
Conclusion: cut out grbio linkage and deal with it...
See, e.g., Erie NWR
#> org_name
#> 1 ERIE NATIONAL WILDLIFE REFUGE
#> 2 ERIE NATIONAL WILDLIFE REFUGE
#> 3 ERIE NATIONAL WILDLIFE REFUGE
#> sci_name lon lat loc_unc_m
#> 1 Symphyotrichum puniceum var. puniceum -80.00139 41.78611 14
#> 2 Symphyotrichum lanceolatum var. lanceolatum -79.95472 41.56987 NA
#> 3 Cornus amomum ssp. obliqua -79.98504 41.58978 14
#> year month day
#> 1 1994 9 7
#> 2 1969 9 27
#> 3 2005 8 4
#> evidence
#> 1 portal.idigbio.org/portal/records/bfb1c8bf-b196-48bd-a908-43a5c85cf51a
#> 2 portal.idigbio.org/portal/records/65130d25-9fc9-4171-92f7-836c02bac556
#> 3 portal.idigbio.org/portal/records/d7b5d622-3024-4b7d-a819-3cd2291b8094
#> bio_repo com_name rank category taxon_code
#> 1 iDigBio purplestem aster Variety Vascular Plant 295996
#> 2 iDigBio NA, white panicle aster Variety Vascular Plant 290904
#> 3 iDigBio silky dogwood Subspecies Vascular Plant 130553
#> tsn note
#> 1 566343 <NA>
#> 2 566832 <NA>
#> 3 27801 <NA>
This enhancement allows users to add a date so that the only records edited or changed after the date will be returned
Currently, ridigbio returns datecollected by default, which we do not recommend to be used in scientific research. When a data provider does not provide a full date in the Darwin Core eventDate field, this complete value or the missing parts (i.e., month and/or day) are randomly generated and thus may lack any real meaning. The generated dates are difficult to detect, as they are randomly distributed. We are currently working to modify our ingestion pipeline to avoid randomly generating dates. However, dates remain an issue across biodiversity aggregators and the solution is not clear (see GBIF for example).
Why does this matter for fwspp?
I found that datecollected is used by this repository as if it was a real value. This may lead to artificial dates being used to make management decisions!
How to use other fields:
We plan to update the ridigbio package to instead return "data.dwc:eventDate", "data.dwc:year", "data.dwc:month", and "data.dwc:day" - which are all text fields, rather than dates. These fields are not randomly generated, instead the values are directly from data providers therefore they may provide meaning in biological research. See current issue and pull request.
Since this package currently downloads "all" fields, I hoped this solution might be only related to your clean_iDigBio function and not to your get_iDigBio function. Sadly, all fields aren't returned when "all" fields are specified. Instead, you will need to specify what fields you need to download. From your code, I believe you all want scientificname, lat/lon, coordinate uncertainty, catalognumber, UUID, and date. To obtain these fields, this is how you would modify the download:
fields2get <- c("data.dwc:scientificName",
"data.dwc:decimalLatitude",
"data.dwc:decimalLongitude",
"data.dwc:coordinateUncertaintyInMeters",
"catalognumber",
"uuid",
"data.dwc:eventDate",
"data.dwc:year",
"data.dwc:month",
"data.dwc:day" )
idb_recs <- try_idb(type = "records", mq = FALSE, rq = rq, fields = fields2get,
max_items = 100000, limit = 0, offset = 0, sort = FALSE,
httr::config(timeout = timeout))
Additional modification to clean_iDigBio will also be needed since the date downloaded here will not be in date format - instead, all dates will be text strings. There are many ways to convert these to dates, for example, see gatoRs remove_duplicate function or ridigbio proposed solution here.
Hope this helps and please let me know if you have any questions or want more specific code suggestions.
The "SpeciesListForImport" tab in the workbook generated by the xlsx_submission function now excludes taxa that are already in FWSpecies. The function cross-references the taxon codes in the workbook with the data in FWSpecies using the FWSpecies API.
The workbook generated by the xlsx_submission function now includes an "ExternalLinks" tab that contains the extra evidence links for the taxa in the "SpeciesListForImport" tab (if the taxon had more than one evidence link).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.