heritagenetwork / regional_sdm Goto Github PK

Collaborative distribution modeling scripts by the Heritage Network and NatureServe.

License: Other

R 100.00%

regional_sdm's Introduction

Regional_SDM

This repository is for collaboration among Heritage programs and NatureServe for the development of distribution models in a consistent way.

You are currently reading the Readme file. Scripts (and other files as needed) are also part of this repository.

Note that most development activity is currently occuring on the two branches: Aquatic and Terrestrial. This master branch is currently behind. Choose one of these branches if you wish to use these scripts. The ultimate vision is to merge the branches back up to master.

to get a local copy

git clone [email protected]:HeritageNetwork/Regional_SDM.git

and then check out the wiki for software and customizations

License

This project is licensed under the terms of the Creative Commons Attribution-ShareAlike 4.0 International Public License.

license:cc-by-sa-4.0

See COPYING <LICENSE.txt> to see the full text.

regional_sdm's People

Contributors

Stargazers

Watchers

regional_sdm's Issues

location of aquatic variable sqlite db?

As we're working to move the lotic EnvVar to a sqlite version instead of a .csv file in order to increase model speed/performance, I have a quick question about its location. Should it be a table in the main sqlite db, or should it be in its own separate sqlite?

The current aquatic EnvVars table for CONUS is about 16GB.

Incorporate rubric into metadata

per Regan's request

Add additional thresholds into aquatic metadata

Minimum reach group, etc

user_run_SDM

I like this wrapper but I also would like to make sure we can easily run the scripts as pieces too. Any suggestions about how to resurrect that ability? (or the ability for the scripts to serve both alternatives at once?).

Add definitions table to end of metadata document

On the terrestrial side, already complete for aquatic.

metadata: funky line spacing in references

every line is somewhat jammed together.

Aquatic - error in HUC subset if project area covers multiple HUC02s

If the selected project area spans multiple HUC2's this error appears.

Using huc_level of 0...
Error in paste(HUCsubset, collapse = "','") :
object 'HUCsubset' not found

citation for code and open source statement

As we get closer to release, we should get a recommended citation for the SDM code as described here
https://help.github.com/articles/referencing-and-citing-content/

Also, some sort of open source statement.

presence file schema

Have there been some thoughts on a presence file schema? It would be great to have the same file setup for both aquatic and terrestrial presences.

Right now for aquatics the scripts expect a csv file with the columns:

COMID	huc12	group_id	EO_ID_ST	SCOMNAME	SNAME	OBSDATE

And for terrestrial we expect a shapefile with the columns:

EO_ID_ST	SNAME	SCOMNAME	SFRACalc	OBSDATE

It would be nice to have shapefiles (e.g., points, lines, or polygons) as expected input, and the same table schema for both. I'd propose:

UID	SPECIES_CODE	EO_ID_ST	GROUP_ID	SFRACalc	OBSDATE
unique feature ID	character code for species	biotics EO_ID (can be NA)	numeric group id	ra value; numeric?	yyyy-mm-dd or NA

Any other columns or ideas for this?

Metadata: issue with \end{minipage}

@tghoward, on line 322 of the .rnw in the terrestrial branch there is a \end{minipage} statement that removes comments and rubric from the pdf when the same code is applied in Aquatic? Is it working ok for you?

Paths and Settings?

Do we still need the 0_pathsAndSettings.R now that we're using here() and the wrapper?

batch the wrapper?

@ChristopherTracey @dnbucklin , what's your setup to batch process 0_user_run_SDM.r? We definitely don't want to be manually changing the entries in that script and waiting for it to run before running a new spp.

metadata: Table 1

Anne suggests that we change the order as follows:

aquatic - It makes more sense to me to list PR reaches, Reach groups, then BG reaches.
terrestrial - Makes more sense to me to keep presence related numbers together. Suggest polys (capitalize the P), EOs, PR points, and BG points.

map sources to spp?

@ChristopherTracey any thoughts how we'll use (or if we'll need to use) the Data Sources table (lkpDataSources) and the map to species table (mapDataSourcesToSpp) in the tracking DB in this project? If you count the mjd as one source, we have only a few sources, but it will vary by species. How might we get these tables filled? Does the mjd count as one source?

metadata: italicize GNAME on map

folder/file structure

@ChristopherTracey @dnbucklin Since the PR where we started talking about this is closed (#4), opening up the discussion again here.

If I'm following correctly, the excellent wrapper system in aquatic (and master) creates folders within a folder named by species code (for example). It also puts, and references, a full set of the repository (the scripts) in there. (species/sppcode/inputs/scripts/Regional_SDM_date).

Since the wrapper will build this file structure for you, I find myself running the wrapper from one copy of the repository, but then somewhere along the line (right away?) it drops into and uses the scripts from the other repository.

I think I understand the reasoning for wanting to save a set of the scripts that were used in that particular run, but it also seems a little funny to not be using the scripts that you have loaded in RStudio!

Perhaps it might make more sense to save the version (commit #?) of the repository but not the entire repository? Can one of you discuss the reasoning here?

Thanks,
Tim

Switch inline reference style to superscript

Instead of [1] to help reach the goal of kinder, gentler, metadata.

Training/presence flowlines removed from model output

I don't think this is how it's supposed to work:

The thick red lines represent the training inputs to the model whereas the blue lines are the results (classified by probability). No result flowlines are overlapping with the input points, so it seems like we are deleting these somewhere (script 4??).

why huc12 in background on terrestrial branch?

I'm confused about background points generation and attribution on the terrestrial side. Do we need Huc12 attribution? Do we need it in an sqlite db?

Regional_SDM/preprocessing/preproc_makeBackgroundPoints.R

Line 14 in 1630bc1

huc12 <- "C:/David/scratch/huc12s_VA.shp"

metadata: figure 2

Possibly do away with the abbreviations (eg "dist") as they are now included in the envvar definition table

threshold used in validation loop

based on @emiliehenderson 's question about what threshold used in the validation loops. It currently is the mean of an estimate of the upper left corner of the ROC curve. So therefore a single value applied to all LOO runs. I'd like to change that to calculating a value for each loop. Does anyone have a preference on threshold metric? How about maxSSS (https://onlinelibrary.wiley.com/doi/full/10.1002/ece3.1878). @ChristopherTracey , ok?

envvar_list code causes script 3 to fail in the aquatic branch

Line 51. Commenting it out seems to do the trick.

envvar_list <- names(df.abs)[names(df.abs) %in% envvar_list] # gets a list of environmental variables

Thoughts?

model_comments vs metaData_comments

Quick confirmation on the difference between model_comments and metaData_comments needed. Model comments are only stored in the SQLite database, and Metadata Comments are displayed on the metadata for public view, right?

Is there any specific way we need to display them?

'repo_head' not found

repo_head isn't being found when it try to write to the modelrun_meta_data table.

Is there any need to start including EGT-ID in the metadata?

@tghoward, we seem to be using EGT-ID a lot more than we used to. Should it be included somewhere on the metadata?--perhaps next to Species Code.

full filename for nm_presfile?

Regional_SDM/user_run_SDM.R

Line 20 in 5512e53

nm_presFile <- here("_data", "occurrence", model_species)

@dnbucklin all of the "nm_" objects in user_run_SDM.R contain the extension for the filename, with the exception of the above quoted line (nm_presfile). Any reason why this one shouldn't also have '.shp' appended too? Seems inconsistent to me. As in, like this:

nm_presFile <- here("_data", "occurrence", paste0(model_species, ".shp"))

NA values in aquatic variables

What happens in the scripts if there are NAs in the EnvVar data? Will it fail? Or just not predict for that flowline?

Related to this, I believe that there are areas of the US where we may have more NAs in certain variables (i.e. the desert southwest). Do we need to add something to the code where after the project area is limited to certain HUCs (based on the species range), it drops any variables that have NA values?

Add G-rank to metadata header

The NatureServe standard is "Scientific name, common name, g-rank (with definition, e.g., 'G1 – Critically Imperiled')"

Automatically create a range map for aquatics based on selected HUCs

Started work on this yesterday:

created a range_huc12 table in background.sqlite that contains the HUC12 id and the WKT for the polygon
in 4_predictModelToStudyArea.R, I added a step at the end to query this table, based on the selected HUCs, dissolve the HUCs into one polygon, and then convert to a shapefile. It writes it to the 'model predictions folder, and names it using model_run_name.

Still need to integrate this into the map as the study area.

timing with crop_mask_rast.R

This is interesting. Crop speed is related to the number of cores, but not as I'd expect. I wonder if it has to do with the number of env vars getting cropped. In this case I am clipping only 4 rasters. The following timings only show (approx) lines 55-56 in script 4.
This timing is using the 11 cores originally in the code

> start_time <- Sys.time()
> source(paste0(loc_scripts, "/helper/crop_mask_rast.R"), local = TRUE)
Reading layer `HUC10' from data source `E:\mobi_repo_tgh_clean\Regional_SDM\_data\other_sp\HUC10.shp' using driver `ESRI Shapefile'
Simple feature collection with 4 features and 17 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 1831561 ymin: 2366316 xmax: 1881251 ymax: 2433963
epsg (SRID):    NA
proj4string:    +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs
Writing layer `clipshp' to data source `E:/mobi_repo_tgh_clean/Regional_SDM/_data/species/bombferv/inputs/temp_rasts' using driver `ESRI Shapefile'
features:       1
fields:         1
geometry type:  Polygon
Creating raster subsets for species for 4 environmental variables...
> envStack <- stack(newL)
> (diff <- Sys.time() - start_time)
Time difference of 7.505285 secs

This timing ups the cores to 30 (35 available on mobiprep)

> start_time <- Sys.time()
> source(paste0(loc_scripts, "/helper/crop_mask_rast.R"), local = TRUE)
Reading layer `HUC10' from data source `E:\mobi_repo_tgh_clean\Regional_SDM\_data\other_sp\HUC10.shp' using driver `ESRI Shapefile'
Simple feature collection with 4 features and 17 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 1831561 ymin: 2366316 xmax: 1881251 ymax: 2433963
epsg (SRID):    NA
proj4string:    +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs
Writing layer `clipshp' to data source `E:/mobi_repo_tgh_clean/Regional_SDM/_data/species/bombferv/inputs/temp_rasts' using driver `ESRI Shapefile'
features:       1
fields:         1
geometry type:  Polygon
Creating raster subsets for species for 4 environmental variables...
> envStack <- stack(newL)
> (diff <- Sys.time() - start_time)
Time difference of 11.99645 secs

and this timing reduces the cores to 4

> start_time <- Sys.time()
> source(paste0(loc_scripts, "/helper/crop_mask_rast.R"), local = TRUE)
Reading layer `HUC10' from data source `E:\mobi_repo_tgh_clean\Regional_SDM\_data\other_sp\HUC10.shp' using driver `ESRI Shapefile'
Simple feature collection with 4 features and 17 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 1831561 ymin: 2366316 xmax: 1881251 ymax: 2433963
epsg (SRID):    NA
proj4string:    +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs
Writing layer `clipshp' to data source `E:/mobi_repo_tgh_clean/Regional_SDM/_data/species/bombferv/inputs/temp_rasts' using driver `ESRI Shapefile'
features:       1
fields:         1
geometry type:  Polygon
Creating raster subsets for species for 4 environmental variables...
> envStack <- stack(newL)
> (diff <- Sys.time() - start_time)
Time difference of 5.81722 secs

we need more testing, and perhaps in real runs we'll always have more rasters than cores, so the point may be mute, but it might be that more cores than layers seems to slow it down.

background samples

Should we change how we generate and store background points/reaches tables, given we're using ranges to define model domains? A couple questions related to this:

Is it a necessary (or just a good idea) to use the same background points for every species? Alternatively we could generate them on-the-fly in step 1. This would allow us to vary the density/number of background points to use by species, if desired.

We could also attribute background points (extract values from EVs) within the model process. I think this is a more flexible way to go and shouldn't add much extra time, given we're already loading the data to attribute presence points.

Should we alter the exclusion distance for background points? It's only 30m now - it could be at least increased to the raster resolution.

Metadata creation - GUI framework cannot be initialized

When running on MoBIprep, the metadata step was errorring out without creating the pdf with the following error:
In system(sprintf("%s %s.sty", kpsewhich(), name), intern = TRUE) : running command 'kpsewhich framed.sty' had status 1
Tracking it down, it looks like MikTex wanted to install packages, but the pop-up window was blocked.

To fix, in the MikTex Console, change the package installation setting to "Always install missing packages on the fly":

Just for your info, one of you can close whenever.

handle polys, points, or both

The terrestrial scripts are going to need to be able to handle polygon only inputs (EO data), points only inputs (observations), and cases where we have both.

clusterR fails with extent argument

In script 4, if I set an extent and then try to use it in the call like this:

 outRas <- clusterR(envStack, predict, args = list(model=rf.full, type = "prob", index = 2, 
                                                      ext = hucExtent, filename = fileNm),  verbose = TRUE)

it fails, while the single core call works fine

    outRas <- predict(object=envStack, model=rf.full, type = "prob", index=2, ext = hucExtent,
                      filename=fileNm, format = "GTiff", overwrite=TRUE)

Does anyone know any other way to get this to work? Is there any chance we can leverage Microsoft R Open here?

Tim

Minimum scale for metadata map

State based? Other method?

database schema

Since I originally volunteered to handle this, I'm assigning myself to it. It seems like we need to settle on at least a working database schema, or we'll be running into a lot of merge conflicts...

Right now in the sqlite folder, there is Tim's latest database schema (sqliteDBDump.txt) and the one I had recently updated (sqlite_template_db_nodata.sql), which has a tracking scheme for input presence files and is used in all the recent updates I've made.

I can merge these with naming in favor of sqliteDBDump.txt, but also includes the features and tables I've added (tblVarsUsed and lkpEnvVarsAqua). Does that seem like a good plan? Any features still missing from either database we'd like to add?

populate ModelerID

@ChristopherTracey in order to address the requested metadata citation refinement (ID modeler), we should populate the lkpModelers table and the ModelerID field in the lkpSpecies table. Use ModelerID = 3 for PA, and I'll use NY = 4 as those were the values from the Eastern Regional Project.

Look into mapview package for metadata maps?

As a follow up to the metadata MoBI tech team call, I thought you might like to check out the mapview package: https://r-spatial.github.io/mapview/index.html. There is a function mapshot that allows you to take a static snapshot of a Leaflet map. Among other things, this would allow you to use any tiled basemap as the background. For instance, I've pasted two different mapshot outputs with different basemaps, saved as PNG files. mapview handles raster, sp, and sf classes, so there's some good flexibility on what final maps could contain. Just some food for thought, and might save some heavy lifting on map output.

By the way, you all have done an excellent job with the metadata template and typesetting. Working with LaTeX is no picnic! 👍

metadata: do we really need to abbreviate PR and BG in Table 1

We have plenty of room to write out "Presence Points" instead of "PR Points" in the table 1 legend:

While the abbreviation is used several times in other parts of the of the metadata, we could easily replace it, make a shorter legend for table 1, and eliminate some jargon. Cool?

st_write: values not successfully written

It looks like I am running into this problem a few times throughout the script. The first is at the bottom of script 2, at st_write

> st_write(points_attributed, paste0("model_input/", filename), delete_layer = T)
Writing layer `bombferv_20181220_140139_att' to data source `model_input/bombferv_20181220_140139_att.shp' using driver `ESRI Shapefile'
features:       1271
fields:         51
geometry type:  Point
There were 12 warnings (use warnings() to see them)
> warnings()[1:2]
Warning messages:
1: In CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options),  ... :
  GDAL Message 1: Value -13706267 of field crvprox100 of feature 533 not successfully written. Possibly due to too larger number with respect to field width
2: In CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options),  ... :
  GDAL Message 1: Value -12236988 of field crvslpx100 of feature 540 not successfully written. Possibly due to too larger number with respect to field width

Attribute presence reaches with HUC12

As referenced in #14, we're going to drop the HUC12 field from the input dataset and assign it from EnvVars table. It's currently used in 1_pointsInPolys_cleanBkgPts.R, but we don't hit the EnvVars until 2_attributePoints.R, sio we'll have to rearrange some things.

bottom of script 3 fails when not run concurrently

If you run script 3 in debugging mode, info at the bottom (such as start_time) doesn't exist?

deleting ".tex",".log",".aux",".out" files if metadata pdf is successfully created

I couldn't see a good reason to keep all those files if the pdf is successfully created, especially since we we're dealing with thousands of files in the MoBI project. So I wrote some code to delete them if the pdf is successfully created. It its not, they are kept.

It's in #43, but please push to master if you think its useful

subfolders missing in 'other_spatial'

the raster and feature subfolders are missing in the github version of the other_spatial folder structure.

dbDisconnect problem with rubric INSERT

@ChristopherTracey are you seeing this at the bottom of script 4c?

> dbSendStatement(db, SQLquery)
<SQLiteResult>
  SQL  INSERT INTO tblRubric (model_run_name, spdata_dataqual, spdata_abs, spdata_eval, envvar_relevance, envvar_align, process_algo, process_sens, process_rigor, process_perform, process_review, products_mapped, products_support, products_repo,interative,spdata_dataqualNotes,spdata_absNotes,spdata_evalNotes,envvar_relevanceNotes,envvar_alignNotes,process_algoNotes,process_sensNotes,process_rigorNotes,process_performNotes,process_reviewNotes,products_mappedNotes,products_supportNotes,products_repoNotes,interativeNotes) VALUES ('anaxexsu_20190109_084552','I','A','A','A','A','I','A','A','A','I','A','I','A','A','','','','','','','','','','','','','','');
  ROWS Fetched: 0 [complete]
       Changed: 1
> ## clean up ----
> dbDisconnect(db)
Warning message:
In connection_release(conn@ptr) :
  There are 1 result in use. The connection will be released when they are closed
>

note that I've tried dbSendQuery, dbExecute, and dbSendStatement and seem to be getting the same results.

presence file names

Maybe this should be included in #14, but what is our input file naming convention? Are we planning on using the CuteCode or EGT?

Change symbology on metadata map

Remove numbers
"low to high" scale

show group captures in table 3?

@ChristopherTracey - I notice you only show results for reaches in Table 3. I've dropped the EO and poly columns but have kept a column for groups. If you are still validating by groups, wouldn't you still want to have a groups column in this table?

Regional_SDM/MetadataEval_knitr.rnw

Line 317 in 83bffe3

 Table 3. Thresholds {\protect\NoHyper\cite{LiuEtAl2005, LiuEtAl2015}\protect\endNoHyper} calculated from the final model. The Value column reports the threshold; Pct indicates the percentage of PR reaches predicted having suitable habitat. Total numbers of PR reaches and contiguous PR reach groups used in the final model are reported in Table 1. 

background pts in terrestrial branch

related to this: #62

But not directly. This script saves an attributed SQLite table
https://github.com/HeritageNetwork/Regional_SDM/blob/master/preprocessing/preproc_attributeBackgroundPts.R

but this script requires spatial information and currently uses st_read

Regional_SDM/1_pointsInPolys_cleanBkgPts.R

Line 185 in 212a18e

backgShapef <- st_read(nm_bkgPts, quiet = T)

The workflow is still to have a master set of background, right?

Fix suggestions?

metadata: what does 'ability to find new sites' mean?

Hey @tghoward, this is a question I've had for a long time. What does the phrase 'ability to find new sites' mean under the thermometer on the metadata?