Giter Club home page Giter Club logo

fwspp's People

Contributors

adamdsmith avatar mccrea-cobb avatar timothymfox avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

fwspp's Issues

Ensure compatibility with FWSpecies sanity checks...

  • The FWSpecies database won't permit abundance values unless a species is "Present". Options are to upgrade the occurrence value OR move abundance values to the Abundance Notes field.

  • A record needs a Nativeness value to be approved. Set all those with missing values to "unknown" or equivalent (see tags in #22)...

add_taxonomy function is returning an error

The add_taxonomy function is returning the error "Taxonomy retrieval failed with the following error:
at least one vector element is required"

The error may be the result of the left_join in the nested join_taxonomy function.

New fxn to check and report problems from a `fws_occ` run

We've taken great pains to catch errors during the fws_occ run. Let's create a fxn to check or review a fws_occ run and report to the user which properties had issues.

fwspp_combine will fail reasonably (but vaguely) when trying to combine fwspp objects with captured errors, but fwspp_review (and specifically xlsx_review) does not...

Import reviewed spreadsheets; export for FWSpecies upload

Once the spreadsheet for review has been reviewed, and records updated/accepted/rejected, need functionality to:

  • import spreadsheet into R
  • drop unaccepted records
  • update taxonomy (in case any taxon codes were added or changed)
  • output into FWSpecies upload format? (NRPC suggest csv will work as well)

Proposed edits to the FWSpecies tags

Expect some changes to Seasonality (Occurrence Class), Origin (Nativeness), and Management tags in the FWSpecies application that will need to be accounted for, most noticeably in the xlsx_review_tags and add_review_validation functions but possibly elsewhere (e.g., review_helpers.R, fwspp_review.R)...

I'm not sure we incorporate Seasonality (i.e., "Occurrence Class") tags yet, though we probably should... ditto for "Management"

UpdatedFWSpeciesTags_5-4-2018.docx

Update fws_occ function to pull occurrence data from ServCat

The fws_occ function uses the public facing ServCat API to extract taxonomic names for the given refuge using the unit code. The records are pulled from ServCat using the following constraints:

  • The record includes the bounding box for the given refuge
  • The record contains no other bounding boxes associated with other properties
  • The record is one of the following: Book Chapter, Conference Proceeding, Conference Proceeding Paper, Geospatial Dataset, Journal Article, Published Report, Published Report Section, Published Report Series, Resource Brief, Tabular Dataset, or Unpublished Report
  • The record is associated with at least one digital file

fws_occ gives error with smaller refuges

Sachuest<-find_fws("Sachuest")
Sachuest_occ <- fws_occ(Sachuest)

causes the following error:
1 properties will be queried:
Sachuest Point NWR (R5)

Processing Sachuest Point NWR
Spherical geometry (s2) switched off
Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is FALSE
Splitting property for more efficient queries.
Spherical geometry (s2) switched on
Server request timeout set to 3 seconds (x4 for GBIF).
Querying the Global Biodiversity Information Facility (GBIF)...
Retrieving 437 records.
Querying Integrated Digitized Biocollections (iDigBio)...
No records found.
Taxonomy retrieval failed with the following error:
at least one vector element is required
Skipping taxonomy. Please send the resulting fwspp object to the maintainer of the
fwspp package. You may also try again later using fwspp::add_taxonomy.

Sachuest_occ
$SACHUEST POINT NATIONAL WILDLIFE REFUGE
<simpleError in if (nrow(idb_recs) > 0) idb_recs <- clean_iDigBio(idb_recs) else idb_recs <- NULL: argument is of length zero>

attr(,"class")
[1] "fwspp"
attr(,"boundary")
[1] "admin"
attr(,"scrubbing")
[1] "strict"
attr(,"buffer_km")
[1] 0
attr(,"query_dt")
[1] "2023-07-20 09:07:05 EDT"

sf (1.0-13) and s2 (1.1.4) update causing error in install_fws_cadastral function

After updating sf and s2 the following error is returned when running install_fws_cadastral:
USFWS Cadastral Database downloaded and installed successfully.
Spherical geometry (s2) switched off
Storing USFWS cadastral geodatabase in a more efficient format. This will take several additional
minutes.
Processing USFWS Interest boundaries.
Error in scan(text = lst[[length(lst)]], quiet = TRUE) :
scan() expected 'a real', got 'ParseException:'
Error in (function (msg) : ParseException: Unknown WKB type 12
The legacy packages maptools, rgdal, and rgeos, underpinning this package
will retire shortly. Please refer to R-spatial evolution reports on
https://r-spatial.org/r/2023/05/15/evolution4.html for details.
This package is now running under evolution status 0

Consider retaining subspecies

Retaining may be as easy as modifying clean_sci_name to retain trinomials, but it will also be necessary to make accommodating changes in several of the taxonomy functions. For example, may have to count words in the scientific name to know whether to retain species or subspecies records...

Tetlin NWR topology exception

fwspp::prep_cadastral(fwspp::find_fws("tetlin"), "admin", T)
#> Error in CPL_geos_union(st_geometry(x), by_feature): Evaluation error: TopologyException: Input geom 0 is invalid: Hole lies outside shell at or near point -141.55936108399999 62.771209905000035 at -141.55936108399999 62.771209905000035.

Consider not extracting direct media URLs

This mainly affects GBIF and BISON queries. Is it generally true that the direct media URL is accessible from the general record URL (e.g., as in iNaturalist)? If so, it may be unnecessary to go fishing for direct media URLs (and faster, in the case of rgbif::occ_data)...

Update method for `fwspp` object

Store as an attribute the datetime a query was initiated. Subsequently use this to pass to the get_* functions to, I suspect, profoundly reduce the size and increase the speed of query updates.

`NA` getting tacked on to combined common names when updating invalid taxa

fwspp::retrieve_taxonomy("Solidago graminifolia")
#>                sci_name          acc_sci_name
#> 1 Solidago graminifolia Euthamia graminifolia
#>                                                       com_name    rank
#> 1 NA, flattop goldentop, flat-top goldentop, slender goldentop Species
#>         category taxon_code   tsn note
#> 1 Vascular Plant     140446 37352 <NA>

Appears to occur when no common name is found for the original taxon but common names are found for the accepted taxon. Probably a simple na.omit fix...

Split properties with widely-spaced polygons

Some properties are relatively small in actual area compared to the area subsumed by their convex hulls. Two very good examples are Great Thicket and Blackwater.

Maybe partition a MULTIPOLYGON into component polygons if the component polygon area is below some threshold of the MULTIPOLYGON bounding box area? For example, the corresponding percentages for Great Thicket and Blackwater are ~1.5% and 4%, respectively. Could possibly ignore this complication if the number of records was relatively small (< 500K maybe) or the absolute bounding box area was relatively small as well...

This split should occur prior to, and not affect, possible temporally-split queries by get_GBIF.

Some refuges causing trouble with `gbif_count`

Alaska Maritime seems related to crossing the international date line, and thus may be best incorporated with the solution to #2.

Items without details spawned a 500 Server error.

  • ALASKA MARITIME NATIONAL WILDLIFE REFUGE
  • BAKER ISLAND NATIONAL WILDLIFE REFUGE
  • BRETON NATIONAL WILDLIFE REFUGE (Request Entity Too Large: WKT too large)
  • HOWLAND ISLAND NATIONAL WILDLIFE REFUGE
  • IZEMBEK NATIONAL WILDLIFE REFUGE (Request Entity Too Large)
  • JARVIS ISLAND NATIONAL WILDLIFE REFUGE
  • JOHNSTON ATOLL NATIONAL WILDLIFE REFUGE
  • KINGMAN REEF NATIONAL WILDLIFE REFUGE
  • MARIANA ARC OF FIRE NATIONAL WILDLIFE REFUGE (Request-URI Too Long)
  • MIDWAY ATOLL NATIONAL WILDLIFE REFUGE
  • NAVASSA ISLAND NATIONAL WILDLIFE REFUGE (Request Entity Too Large)
  • PALMYRA ATOLL NATIONAL WILDLIFE REFUGE
  • SUSQUEHANNA NATIONAL WILDLIFE REFUGE (Request Entity Too Large)
  • WAKE ATOLL NATIONAL WILDLIFE REFUGE

Reduce the over the top error handling

Way too convoluted error handling. The way it's set up, if some stage of manage_gets fails, that should be captured and the process broken, saved, and moved along. Thus, it seems that performing manage_gets safely will be adequate. Removal of purrr::safely from get_verb_N will affect several files in many locations, but worth it for clarity's sake...

Update xlsx_submission function so each record is unique

The workbook created by the xlsx_submission function now contains the "SpeciesListForImport" tab. Each row in this tab is associated with a unique taxon observation, and the data are formatted to be consistent with the FWSpecies bulk submission template.

Get cadastral data from AGOL

Users are currently required to download the entire FWS cadastral dataset from ServCat. It would be more efficient to download specific refuge data using the REST API from ArcGIS Online.

Update spatial function to sf

The package current depends on sp and geos, which will be deprecated in Oct 2023. We need to update all spatial functions to sf.

Accommodate string nativeness requirement

From Sarah Shultz:

The database has a slightly picky rule that requires that a nativeness value be assigned if a record is approved. Here are a couple of options, let me know if either are agreeable:

  • Set nativeness to "unknown" (where currently null) and then update in the future if/when someone has time to seek out this information
  • Leave nativeness blank and set the record status to "in review" (but remember, then the records won't show up on the basic checklist)

In short, when processing FWSpecies reviews for submission with fwspp_submission, we need to give the user one of the above options, with the first above as the default.

Review output

THIS ISSUE IS A WORK IN PROGRESS

Currently, output fwspp object is a list of data.frames (one per property) that may have taxonomic information. EDIT: Require them to have taxonomy. Review is frustrating without it...

Need option to take this object and export to spreadsheet for review with the following columns:

  • org_name (narrow column since it'll be superfluous, but necessary when re-importing)
  • category
  • taxon_code
  • sci_name
  • com_name
  • occurrence (as evaluated when compared against existing FWSpecies records for property)
  • nativeness (imported from existing FWSpecies [?] or blank for new records)
  • accept_record (defaults to YES for all retained records)
  • evidence (or ExternalLinks)
  • note (notes will be useful for identifying records that may be corrected at this stage)

grbio data is useful, but not unique

There are several hundred instances of GRBIO institutions with the same acronym. These are causing problems during cleaning. There's no obvious way to assign unique acronyms and still link to the occurrence data based on the institution code.

Conclusion: cut out grbio linkage and deal with it...

iDigBio seems to be letting varieties and subspecies through?

See, e.g., Erie NWR

#>                        org_name
#> 1 ERIE NATIONAL WILDLIFE REFUGE
#> 2 ERIE NATIONAL WILDLIFE REFUGE
#> 3 ERIE NATIONAL WILDLIFE REFUGE
#>                                      sci_name       lon      lat loc_unc_m
#> 1       Symphyotrichum puniceum var. puniceum -80.00139 41.78611        14
#> 2 Symphyotrichum lanceolatum var. lanceolatum -79.95472 41.56987        NA
#> 3                  Cornus amomum ssp. obliqua -79.98504 41.58978        14
#>   year month day
#> 1 1994     9   7
#> 2 1969     9  27
#> 3 2005     8   4
#>                                                                 evidence
#> 1 portal.idigbio.org/portal/records/bfb1c8bf-b196-48bd-a908-43a5c85cf51a
#> 2 portal.idigbio.org/portal/records/65130d25-9fc9-4171-92f7-836c02bac556
#> 3 portal.idigbio.org/portal/records/d7b5d622-3024-4b7d-a819-3cd2291b8094
#>   bio_repo                com_name       rank       category taxon_code
#> 1  iDigBio        purplestem aster    Variety Vascular Plant     295996
#> 2  iDigBio NA, white panicle aster    Variety Vascular Plant     290904
#> 3  iDigBio           silky dogwood Subspecies Vascular Plant     130553
#>      tsn note
#> 1 566343 <NA>
#> 2 566832 <NA>
#> 3  27801 <NA>

datecollected vs eventDate

Currently, ridigbio returns datecollected by default, which we do not recommend to be used in scientific research. When a data provider does not provide a full date in the Darwin Core eventDate field, this complete value or the missing parts (i.e., month and/or day) are randomly generated and thus may lack any real meaning. The generated dates are difficult to detect, as they are randomly distributed. We are currently working to modify our ingestion pipeline to avoid randomly generating dates. However, dates remain an issue across biodiversity aggregators and the solution is not clear (see GBIF for example).

Why does this matter for fwspp?
I found that datecollected is used by this repository as if it was a real value. This may lead to artificial dates being used to make management decisions!

How to use other fields:
We plan to update the ridigbio package to instead return "data.dwc:eventDate", "data.dwc:year", "data.dwc:month", and "data.dwc:day" - which are all text fields, rather than dates. These fields are not randomly generated, instead the values are directly from data providers therefore they may provide meaning in biological research. See current issue and pull request.

Since this package currently downloads "all" fields, I hoped this solution might be only related to your clean_iDigBio function and not to your get_iDigBio function. Sadly, all fields aren't returned when "all" fields are specified. Instead, you will need to specify what fields you need to download. From your code, I believe you all want scientificname, lat/lon, coordinate uncertainty, catalognumber, UUID, and date. To obtain these fields, this is how you would modify the download:

fields2get <- c("data.dwc:scientificName",  
                           "data.dwc:decimalLatitude",   
                           "data.dwc:decimalLongitude",
                           "data.dwc:coordinateUncertaintyInMeters",  
                           "catalognumber",
                           "uuid", 
                           "data.dwc:eventDate", 
                           "data.dwc:year", 
                          "data.dwc:month", 
                          "data.dwc:day" )
 idb_recs <- try_idb(type = "records", mq = FALSE, rq = rq,  fields = fields2get,
                        max_items = 100000, limit = 0, offset = 0, sort = FALSE,
                        httr::config(timeout = timeout))

Additional modification to clean_iDigBio will also be needed since the date downloaded here will not be in date format - instead, all dates will be text strings. There are many ways to convert these to dates, for example, see gatoRs remove_duplicate function or ridigbio proposed solution here.

Hope this helps and please let me know if you have any questions or want more specific code suggestions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.