heritagenetwork / regional_sdm Goto Github PK
View Code? Open in Web Editor NEWCollaborative distribution modeling scripts by the Heritage Network and NatureServe.
License: Other
Collaborative distribution modeling scripts by the Heritage Network and NatureServe.
License: Other
Maybe this should be included in #14, but what is our input file naming convention? Are we planning on using the CuteCode or EGT?
@ChristopherTracey are you seeing this at the bottom of script 4c?
> dbSendStatement(db, SQLquery)
<SQLiteResult>
SQL INSERT INTO tblRubric (model_run_name, spdata_dataqual, spdata_abs, spdata_eval, envvar_relevance, envvar_align, process_algo, process_sens, process_rigor, process_perform, process_review, products_mapped, products_support, products_repo,interative,spdata_dataqualNotes,spdata_absNotes,spdata_evalNotes,envvar_relevanceNotes,envvar_alignNotes,process_algoNotes,process_sensNotes,process_rigorNotes,process_performNotes,process_reviewNotes,products_mappedNotes,products_supportNotes,products_repoNotes,interativeNotes) VALUES ('anaxexsu_20190109_084552','I','A','A','A','A','I','A','A','A','I','A','I','A','A','','','','','','','','','','','','','','');
ROWS Fetched: 0 [complete]
Changed: 1
> ## clean up ----
> dbDisconnect(db)
Warning message:
In connection_release(conn@ptr) :
There are 1 result in use. The connection will be released when they are closed
>
note that I've tried dbSendQuery
, dbExecute
, and dbSendStatement
and seem to be getting the same results.
Have there been some thoughts on a presence file schema? It would be great to have the same file setup for both aquatic and terrestrial presences.
Right now for aquatics the scripts expect a csv
file with the columns:
COMID | huc12 | group_id | EO_ID_ST | SCOMNAME | SNAME | OBSDATE |
---|
And for terrestrial we expect a shapefile with the columns:
EO_ID_ST | SNAME | SCOMNAME | SFRACalc | OBSDATE |
---|
It would be nice to have shapefiles (e.g., points, lines, or polygons) as expected input, and the same table schema for both. I'd propose:
UID | SPECIES_CODE | EO_ID_ST | GROUP_ID | SFRACalc | OBSDATE |
---|---|---|---|---|---|
unique feature ID | character code for species | biotics EO_ID (can be NA) | numeric group id | ra value; numeric? | yyyy-mm-dd or NA |
Any other columns or ideas for this?
As a follow up to the metadata MoBI tech team call, I thought you might like to check out the mapview
package: https://r-spatial.github.io/mapview/index.html. There is a function mapshot
that allows you to take a static snapshot of a Leaflet map. Among other things, this would allow you to use any tiled basemap as the background. For instance, I've pasted two different mapshot
outputs with different basemaps, saved as PNG files. mapview
handles raster
, sp
, and sf
classes, so there's some good flexibility on what final maps could contain. Just some food for thought, and might save some heavy lifting on map output.
By the way, you all have done an excellent job with the metadata template and typesetting. Working with LaTeX is no picnic! ๐
In script 4, if I set an extent and then try to use it in the call like this:
outRas <- clusterR(envStack, predict, args = list(model=rf.full, type = "prob", index = 2,
ext = hucExtent, filename = fileNm), verbose = TRUE)
it fails, while the single core call works fine
outRas <- predict(object=envStack, model=rf.full, type = "prob", index=2, ext = hucExtent,
filename=fileNm, format = "GTiff", overwrite=TRUE)
Does anyone know any other way to get this to work? Is there any chance we can leverage Microsoft R Open here?
Tim
Since I originally volunteered to handle this, I'm assigning myself to it. It seems like we need to settle on at least a working database schema, or we'll be running into a lot of merge conflicts...
Right now in the sqlite
folder, there is Tim's latest database schema (sqliteDBDump.txt
) and the one I had recently updated (sqlite_template_db_nodata.sql
), which has a tracking scheme for input presence files and is used in all the recent updates I've made.
I can merge these with naming in favor of sqliteDBDump.txt
, but also includes the features and tables I've added (tblVarsUsed
and lkpEnvVarsAqua
). Does that seem like a good plan? Any features still missing from either database we'd like to add?
@tghoward, on line 322 of the .rnw in the terrestrial branch there is a \end{minipage}
statement that removes comments and rubric from the pdf when the same code is applied in Aquatic? Is it working ok for you?
If the selected project area spans multiple HUC2's this error appears.
Using huc_level of 0...
Error in paste(HUCsubset, collapse = "','") :
object 'HUCsubset' not found
Line 51. Commenting it out seems to do the trick.
envvar_list <- names(df.abs)[names(df.abs) %in% envvar_list] # gets a list of environmental variables
Thoughts?
related to this: #62
But not directly. This script saves an attributed SQLite table
https://github.com/HeritageNetwork/Regional_SDM/blob/master/preprocessing/preproc_attributeBackgroundPts.R
but this script requires spatial information and currently uses st_read
Regional_SDM/1_pointsInPolys_cleanBkgPts.R
Line 185 in 212a18e
The workflow is still to have a master set of background, right?
Fix suggestions?
I like this wrapper but I also would like to make sure we can easily run the scripts as pieces too. Any suggestions about how to resurrect that ability? (or the ability for the scripts to serve both alternatives at once?).
@ChristopherTracey in order to address the requested metadata citation refinement (ID modeler), we should populate the lkpModelers
table and the ModelerID
field in the lkpSpecies
table. Use ModelerID = 3 for PA, and I'll use NY = 4 as those were the values from the Eastern Regional Project.
Anne suggests that we change the order as follows:
Quick confirmation on the difference between model_comments
and metaData_comments
needed. Model comments are only stored in the SQLite database, and Metadata Comments are displayed on the metadata for public view, right?
Is there any specific way we need to display them?
I don't think this is how it's supposed to work:
The thick red lines represent the training inputs to the model whereas the blue lines are the results (classified by probability). No result flowlines are overlapping with the input points, so it seems like we are deleting these somewhere (script 4??).
As referenced in #14, we're going to drop the HUC12 field from the input dataset and assign it from EnvVars table. It's currently used in 1_pointsInPolys_cleanBkgPts.R, but we don't hit the EnvVars until 2_attributePoints.R, sio we'll have to rearrange some things.
Do we still need the 0_pathsAndSettings.R
now that we're using here()
and the wrapper?
I couldn't see a good reason to keep all those files if the pdf is successfully created, especially since we we're dealing with thousands of files in the MoBI project. So I wrote some code to delete them if the pdf is successfully created. It its not, they are kept.
It's in #43, but please push to master if you think its useful
When running on MoBIprep, the metadata step was errorring out without creating the pdf with the following error:
In system(sprintf("%s %s.sty", kpsewhich(), name), intern = TRUE) : running command 'kpsewhich framed.sty' had status 1
Tracking it down, it looks like MikTex wanted to install packages, but the pop-up window was blocked.
To fix, in the MikTex Console, change the package installation setting to "Always install missing packages on the fly":
Just for your info, one of you can close whenever.
Line 20 in 5512e53
@dnbucklin all of the "nm_" objects in user_run_SDM.R
contain the extension for the filename, with the exception of the above quoted line (nm_presfile
). Any reason why this one shouldn't also have '.shp' appended too? Seems inconsistent to me. As in, like this:
nm_presFile <- here("_data", "occurrence", paste0(model_species, ".shp"))
Minimum reach group, etc
This is interesting. Crop speed is related to the number of cores, but not as I'd expect. I wonder if it has to do with the number of env vars getting cropped. In this case I am clipping only 4 rasters. The following timings only show (approx) lines 55-56 in script 4.
This timing is using the 11 cores originally in the code
> start_time <- Sys.time()
> source(paste0(loc_scripts, "/helper/crop_mask_rast.R"), local = TRUE)
Reading layer `HUC10' from data source `E:\mobi_repo_tgh_clean\Regional_SDM\_data\other_sp\HUC10.shp' using driver `ESRI Shapefile'
Simple feature collection with 4 features and 17 fields
geometry type: POLYGON
dimension: XY
bbox: xmin: 1831561 ymin: 2366316 xmax: 1881251 ymax: 2433963
epsg (SRID): NA
proj4string: +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs
Writing layer `clipshp' to data source `E:/mobi_repo_tgh_clean/Regional_SDM/_data/species/bombferv/inputs/temp_rasts' using driver `ESRI Shapefile'
features: 1
fields: 1
geometry type: Polygon
Creating raster subsets for species for 4 environmental variables...
> envStack <- stack(newL)
> (diff <- Sys.time() - start_time)
Time difference of 7.505285 secs
This timing ups the cores to 30 (35 available on mobiprep)
> start_time <- Sys.time()
> source(paste0(loc_scripts, "/helper/crop_mask_rast.R"), local = TRUE)
Reading layer `HUC10' from data source `E:\mobi_repo_tgh_clean\Regional_SDM\_data\other_sp\HUC10.shp' using driver `ESRI Shapefile'
Simple feature collection with 4 features and 17 fields
geometry type: POLYGON
dimension: XY
bbox: xmin: 1831561 ymin: 2366316 xmax: 1881251 ymax: 2433963
epsg (SRID): NA
proj4string: +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs
Writing layer `clipshp' to data source `E:/mobi_repo_tgh_clean/Regional_SDM/_data/species/bombferv/inputs/temp_rasts' using driver `ESRI Shapefile'
features: 1
fields: 1
geometry type: Polygon
Creating raster subsets for species for 4 environmental variables...
> envStack <- stack(newL)
> (diff <- Sys.time() - start_time)
Time difference of 11.99645 secs
and this timing reduces the cores to 4
> start_time <- Sys.time()
> source(paste0(loc_scripts, "/helper/crop_mask_rast.R"), local = TRUE)
Reading layer `HUC10' from data source `E:\mobi_repo_tgh_clean\Regional_SDM\_data\other_sp\HUC10.shp' using driver `ESRI Shapefile'
Simple feature collection with 4 features and 17 fields
geometry type: POLYGON
dimension: XY
bbox: xmin: 1831561 ymin: 2366316 xmax: 1881251 ymax: 2433963
epsg (SRID): NA
proj4string: +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs
Writing layer `clipshp' to data source `E:/mobi_repo_tgh_clean/Regional_SDM/_data/species/bombferv/inputs/temp_rasts' using driver `ESRI Shapefile'
features: 1
fields: 1
geometry type: Polygon
Creating raster subsets for species for 4 environmental variables...
> envStack <- stack(newL)
> (diff <- Sys.time() - start_time)
Time difference of 5.81722 secs
we need more testing, and perhaps in real runs we'll always have more rasters than cores, so the point may be mute, but it might be that more cores than layers seems to slow it down.
every line is somewhat jammed together.
What happens in the scripts if there are NAs in the EnvVar data? Will it fail? Or just not predict for that flowline?
Related to this, I believe that there are areas of the US where we may have more NAs in certain variables (i.e. the desert southwest). Do we need to add something to the code where after the project area is limited to certain HUCs (based on the species range), it drops any variables that have NA values?
@ChristopherTracey @dnbucklin , what's your setup to batch process 0_user_run_SDM.r
? We definitely don't want to be manually changing the entries in that script and waiting for it to run before running a new spp.
As we get closer to release, we should get a recommended citation for the SDM code as described here
https://help.github.com/articles/referencing-and-citing-content/
Also, some sort of open source statement.
Started work on this yesterday:
range_huc12
table in background.sqlite that contains the HUC12 id and the WKT for the polygonmodel_run_name
.Still need to integrate this into the map as the study area.
Hey @tghoward, this is a question I've had for a long time. What does the phrase 'ability to find new sites' mean under the thermometer on the metadata?
The terrestrial scripts are going to need to be able to handle polygon only inputs (EO data), points only inputs (observations), and cases where we have both.
per Regan's request
I'm confused about background points generation and attribution on the terrestrial side. Do we need Huc12 attribution? Do we need it in an sqlite db?
State based? Other method?
repo_head
isn't being found when it try to write to the modelrun_meta_data table.
The NatureServe standard is "Scientific name, common name, g-rank (with definition, e.g., 'G1 โ Critically Imperiled')"
It looks like I am running into this problem a few times throughout the script. The first is at the bottom of script 2, at st_write
> st_write(points_attributed, paste0("model_input/", filename), delete_layer = T)
Writing layer `bombferv_20181220_140139_att' to data source `model_input/bombferv_20181220_140139_att.shp' using driver `ESRI Shapefile'
features: 1271
fields: 51
geometry type: Point
There were 12 warnings (use warnings() to see them)
> warnings()[1:2]
Warning messages:
1: In CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), ... :
GDAL Message 1: Value -13706267 of field crvprox100 of feature 533 not successfully written. Possibly due to too larger number with respect to field width
2: In CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), ... :
GDAL Message 1: Value -12236988 of field crvslpx100 of feature 540 not successfully written. Possibly due to too larger number with respect to field width
@tghoward, we seem to be using EGT-ID a lot more than we used to. Should it be included somewhere on the metadata?--perhaps next to Species Code
.
the raster and feature subfolders are missing in the github version of the other_spatial folder structure.
Instead of [1]
to help reach the goal of kinder, gentler, metadata.
Possibly do away with the abbreviations (eg "dist") as they are now included in the envvar definition table
@ChristopherTracey - I notice you only show results for reaches in Table 3. I've dropped the EO and poly columns but have kept a column for groups. If you are still validating by groups, wouldn't you still want to have a groups column in this table?
Regional_SDM/MetadataEval_knitr.rnw
Line 317 in 83bffe3
@ChristopherTracey @dnbucklin Since the PR where we started talking about this is closed (#4), opening up the discussion again here.
If I'm following correctly, the excellent wrapper system in aquatic (and master) creates folders within a folder named by species code (for example). It also puts, and references, a full set of the repository (the scripts) in there. (species/sppcode/inputs/scripts/Regional_SDM_date).
Since the wrapper will build this file structure for you, I find myself running the wrapper from one copy of the repository, but then somewhere along the line (right away?) it drops into and uses the scripts from the other repository.
I think I understand the reasoning for wanting to save a set of the scripts that were used in that particular run, but it also seems a little funny to not be using the scripts that you have loaded in RStudio!
Perhaps it might make more sense to save the version (commit #?) of the repository but not the entire repository? Can one of you discuss the reasoning here?
Thanks,
Tim
Should we change how we generate and store background points/reaches tables, given we're using ranges to define model domains? A couple questions related to this:
Is it a necessary (or just a good idea) to use the same background points for every species? Alternatively we could generate them on-the-fly in step 1. This would allow us to vary the density/number of background points to use by species, if desired.
We could also attribute background points (extract values from EVs) within the model process. I think this is a more flexible way to go and shouldn't add much extra time, given we're already loading the data to attribute presence points.
Should we alter the exclusion distance for background points? It's only 30m now - it could be at least increased to the raster resolution.
based on @emiliehenderson 's question about what threshold used in the validation loops. It currently is the mean of an estimate of the upper left corner of the ROC curve. So therefore a single value applied to all LOO runs. I'd like to change that to calculating a value for each loop. Does anyone have a preference on threshold metric? How about maxSSS (https://onlinelibrary.wiley.com/doi/full/10.1002/ece3.1878). @ChristopherTracey , ok?
@ChristopherTracey any thoughts how we'll use (or if we'll need to use) the Data Sources table (lkpDataSources) and the map to species table (mapDataSourcesToSpp) in the tracking DB in this project? If you count the mjd as one source, we have only a few sources, but it will vary by species. How might we get these tables filled? Does the mjd count as one source?
On the terrestrial side, already complete for aquatic.
If you run script 3 in debugging mode, info at the bottom (such as start_time) doesn't exist?
As we're working to move the lotic EnvVar to a sqlite version instead of a .csv file in order to increase model speed/performance, I have a quick question about its location. Should it be a table in the main sqlite db, or should it be in its own separate sqlite?
The current aquatic EnvVars table for CONUS is about 16GB.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.