ncss-tech / soilreports Goto Github PK

An R package that assists with the setup and operation of a collection of soil data summary, comparison, and evaluation reports. These reports are primarily used by USDA-NRCS soil scientists in both initial and update mapping.

R 99.44% CSS 0.56%

usda nrcs soil soil-survey nasis

soilreports's Introduction

soilReports

Reports are a handy way to summarize large volumes of data, particularly with figures and tables. soilReports is an R package "container" designed to accommodate the maintenance, documentation, and distribution of R-based reporting tools. Inside the package are report templates, setup files, documentation, and example configuration files.

The soilReports package provides a couple important helper functions that do most of the work:

listReports(): print a listing of the available reports, version numbers, and basic metadata
reportSetup(...): download any R packages required by the named report, e.g. "region2/mu-comparison"
reportInit(...) | reportCopy(...): copy a named report template into a specific directory
reportUpdate(...): update a named report in a specific directory, replacing report.Rmd only

Each report contains several files:

report.Rmd: an R Markdown file that is "knit" into a final HTML or DOC report
README.md: report-specific instructions
custom.R: report-specific functions
categorical_definitions.R: report-specific color mapping and metadata for categorical raster data (user-editable)
config.R: configuration file to set report parameters (user-editable)
changes.txt: notes on changes and associated version numbers

R Profile Setup

NOTE: The following instructions are rarely, if ever, needed with R 4.2+

On many of our machines, the $HOME directory points to a network share. This can cause all kinds of problems when installing R packages, especially if you connect to the network by VPN. The following code is a one-time solution and will cause R packages to be installed on a local disk by adding an .Rprofile file to your $HOME directory. This file will instruct R to use C:/Users/FirstName.LastName/Documents/R/ for installing R packages. Again, you only have to do this once.

# determine your current $HOME directory
path.expand('~')

# install .Rprofile
source('https://raw.githubusercontent.com/ncss-tech/soilReports/master/R/installRprofile.R')
installRprofile(overwrite=TRUE)

soilReports Installation - First time or after R upgrade

Run this code if you don't yet have the soilReports package or after a new version of R has been installed on your machine.

# need devtools to install packages from GitHub
install.packages('remotes', dep = TRUE)

# get the latest version of the 'soilReports' package
remotes::install_github("ncss-tech/soilReports", dependencies = FALSE, upgrade_dependencies = FALSE)

Choose an Available Report

Region 2
Region 11
- Component Summary by Project
- MUPOLYGON Summary by Project

Example Output

Reports for Raster Summary by MU or MLRA

Reports for DMU QC/QA

Reports for Pedon Data

Run a Report - Example: Map Unit Comparison report

# load this library
library(soilReports)

# list reports in the package
listReports()

# install required packages for a named report
reportSetup(reportName='region2/mu-comparison')

# copy report file 'MU-comparison' to your current working directory
reportInit(reportName='region2/mu-comparison', outputDir='MU-comparison')

Updating Existing Reports - Example: Map Unit Comparison report

Updates to report templates, documentation, and custom functions are available after installing the latest soilReports package from GitHub. Use the following examples to update an existing copy of the "region2/mu-comparison" report. Note that your existing configuration files will not be modified.

# get latest version of package + report templates
remotes::install_github("ncss-tech/soilReports", dependencies=FALSE, upgrade_dependencies=FALSE)

# load this library
library(soilReports)

# get any new packages that may be required by the latest version
reportSetup(reportName='region2/mu-comparison')

# overwrite report files in an existing report instance (does NOT overwrite config)
reportUpdate(reportName='region2/mu-comparison', outputDir='MU-comparison')

Suggested Background Material

The user is familiar with Rstudio
NASIS selected set is loaded with the necessary tables (e.g. "Project - legend/mapunit/dmu by sso, pname & uprojectid")
ODBC connection to NASIS is setup
custom .Rprofile exists
necessary R packages are installed

Troubleshooting

If you haven't run R in a while, consider updating all packages with: update.packages(ask=FALSE, checkBuilt=TRUE).
Make sure that all raster data sources are GDAL-compatible formats: GeoTiff, ERDAS IMG, ArcGRID, etc. (not ESRI FGDB)
Make sure that the map unit polygon data source is an OGR-compatible format: ESRI SHP, ESRI FGDB, etc.
Make sure that the extent of raster data includes the full extent of map unit polygon data.
If there is a problem installing packages with reportSetup(), consider adding the upgrade=TRUE argument.
If you are encountering errors with "Knit HTML" in RStudio, try: update.packages(ask=FALSE, checkBuilt=TRUE).

TODO

See issue tracker for TODO items.

Related Packages

soilreports's People

Contributors

Stargazers

Watchers

Forkers

jennifer-wood kant drroad bafuentes aditiregina

soilreports's Issues

`plot(Hmisc::varclus(...))` fails when there are only 2 raster data sources

Add some error trapping to avoid report failures.

colors in figures do not match when length(mu.set) > 7

why?

better tests for bugs related to small sample sizes and low variability

currently using SD < 1e-5 as threshold
* clhs() breaks when sd == 0
* masking 5-95 pctile interval results in not enough data for MDS
* figure out reasonable heuristic (multivariate CV?)

Should account for specific low-variance map units and filter those out instead of barfing.

set HTML filename based on `config.R`

It would be nice if the output file names were specific to the map units named in config.R.

better integration of mapview

http://ncss-tech.github.io/soil-pit/projects/MLRA111B_Glynwood_B-slope_Erosion_Northeastern_IN.html

add timers and other logging hooks

display at bottom of report

condense sampling density guidelines, put in instructions

consider adding USGS maps

https://rmgsc.cr.usgs.gov/outgoing/ecosystems/USdata/

combine median of continuous, geomorphons proportions, and NLCD proportions for dendrogram

explore other ordination methods

tried other MDS methods:
- MASS::sammon() fails when there are duplicates in the initial configuration
- vegan::monoMDS() gives similar results as MASS::isoMDS()
- MASS::isoMDS() is the fastest, most stable algorithm I have tried
- tried tsne algorithm (https://lvdmaaten.github.io/tsne/), slow and runs out of memory with large datasets

add variable correlation matrix / plot

Ideas:

https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

dynamic figure sizes

perform adjustment based on:

number of map units
number of rasters

add detailed geomorphons report

add functionality from single MU report to new report or MU-comparison

From here: http://ncss-tech.github.io/AQP/sharpshootR/GIS-summary-by-MU.html

Specifically:

flagging polygons with values outside of the 5-95 interval, all rasters?
clustering of polygons based on raster values

These operations would have to be done a on a subset of the samples, possibly in conjunction with the multivariate summary.

more useful metrics in output SHP

Currently the only "value added" components are median raster values for each polygon.

drop some quantiles from tab. summaries and add mean, SD, CV

parallel computation

Likely in sharpshootR functions.

Ideas: http://datascienceplus.com/parallel-vectorized-operations/

bootstrap package installation when .Rprofile isn't in place

There is no clean way to get devtools and soilReports into the "correct" library paths when:

an .Rprofile file is missing
HOME is set to H:/

Possible solution: source installRprofile() from GH and run before installing anything.

TODO: test!

add slope class breakdown

slope class defs in config.R
classify slope map (if there is one) and present as proportions
sanity checks

need better selection of bandwidth parameter in density plots

Sometimes and inefficient bandwidth parameter is selected for smoothing in density plots:

Testing of new effective sampling size calculations and box-whisker plots

New process:

load rasters into memory if possible
perform cursory grid-based sampling to determine Moran's I of each raster
set sampling intensity based on I of each raster

This is all implemented in:

sharpshootR::sampleRasterStackByMU(..., estimateEffectiveSampleSize=TRUE)
sharpshootR::Moran_I_ByRaster()
sharpshootR::ESS_by_Moran_I()

These functions have not been extensively documented or tested.

Interactive (Shiny) reports - pedon summary report example

In the 'region2' folder under branch 'AGB' you will find a skeleton of a pedon summary report that employs Shiny and flexdashboard for user interaction/visualization of pedon data. A static report can be knitted after tweaking parameters of interest.

There are a few different moving parts here that extend beyond the typical case we have for soilReports.

The main workhorse is shiny.Rmd. This is the file that would be run from within Rstudio.

shiny.Rmd sources config.R, utility_functions.R, and main.R.

Upon running shiny.Rmd, an interface showing data from the user's NASIS selected set will appear. config.R allows a few different settings (linework data source, caching, generalized horizons, as well as loading of raster data sources) that are pre-requisites for creating the Shiny interface and its contents.

From the Shiny interface, the user can utilize regex patterns to filter pedon data by MUSYM, taxon name, pedonID or can explicitly specify a list of pedons. Also, they can select a 'modal' pedon from their subset for comparison.

The tabs to the right of the input panel correspond to sections of the pedon summary report with some new additions that display modal pedon data, horizon generalization, etc.

Once a user is happy with the selected set of pedons and is ready to create a permanent record of their work, they hit the "Export" button which will knit a static HTML file containing the same information displayed in the Shiny interface. The R objects that are built in the Shiny interface are passed on to the knit session by creating an R environment containing the relevant dependencies for the template "report.Rmd" to run. "Export" also dumps out copies of the R objects (associated component table, soil profile collections) and tabular output that uniquely identifies the pedons used.

The issue here is how can this style of report be integrated into soilReports?

Let me know what you think if you get a chance to try this out. You'll need pedons and some DMU components in your selected set, update the config to point to your linework, and then once the dashboard loads will need to adjust the default MUSYM pattern for parsing the NASIS selected set.

add option to cache samples

Caching samples between report runs would speed report development and tweaking of parameters. However, it could cause confusion when a report is re-run after changes have been made to the linework.

Consider setting this flag to FALSE as a default. Also, this may be related to debugging output such as chunk timing.

better truncation of variable names in SHP output

strip non alpha-numeric chars

merge changes from personal "reports" repo into this version

fix NOTES.md files that go with reports

estimate effective DF from spatial samples

Graphical and formalized comparison are difficult due to the VERY high spatial autocorrelation. How can we down-weight the DF accordingly?

Ideas

1. http://www.inside-r.org/packages/cran/SpatialPack/docs/modified.ttest
2. use `clhs` output for bwplots and distributional tests... realistic?
3. weighting via local Moran's I
4. https://github.com/jebyrnes/spatial_correction_lavaan
5. [comparison of methods](http://www.petrkeil.com/?p=1050)
6. [faster ? calculation via ape::Moran.I and distance matrix](http://www.ats.ucla.edu/stat/r/faq/morans_i.htm)

Once we figure out an accurate and scale-able approach, we still need to figure out how to get these adjust sample size numbers into custom.bwplot() via bwplot(). This may require a custom panel function that performs a look-up against the spatial stats summary table.

report configuration

Ideally, most reports should be configurable via:

config.R
interactive report parameters
parameters specified in a call to render() or in the report YAML header

There are cases where each has benefits. Converting the "region 2" reports to use rmarkdown report parameter names shouldn't take all that long.

sync colors in multivariate displays after filtering low-sd mu

When there are MU with low sd, they are filtered before subsetting via clhs. This causes color syncing issues with univariate summaries. Does this also cause syncing issues with labels in the ordination?

More expressive raster list in config.R

Start with something like this:

raster.list <- list(
  continuous=list(
    `Mean Annual Air Temperature (degrees C)`='E:/gis_data/prism/final_MAAT_800m.tif',
    `Mean Annual Precipitation (mm)`='E:/gis_data/prism/final_MAP_mm_800m.tif',
    `Effective Precipitation (mm)`='E:/gis_data/prism/effective_precipitation_800m.tif',
    `Frost-Free Days`='E:/gis_data/prism/ffd_mean_800m.tif',
    `Growing Degree Days (degrees C)`='E:/gis_data/prism/gdd_mean_800m.tif',
    `Elevation (m)`='E:/gis_data/region-2-mu-analysis/elev_30.tif',
    `Slope Gradient (%)`='E:/gis_data/region-2-mu-analysis/slope_30.tif',
    `Annual Beam Radiance (MJ/sq.m)`='E:/gis_data/ca630/beam_rad_sum_mj_30m.tif',
    `(Estimated) MAST (degrees C)`='E:/gis_data/ca630/mast-model.tif',
    `Compound Topographic Index`='E:/gis_data/ca630/tci30.tif',
    `MRVBF`='E:/gis_data/ca630/mrvbf_10.tif',
    `SAGA TWI`='E:/gis_data/ca630/saga_twi_10.tif'
  ),
  categorical=list(
    `Geomorphon Landforms`='L:/Geodata/DEM_derived/forms10.tif',
    `Curvature Classes`='E:/gis_data/ca630/curvature_classes_15.tif'
  ),
  circular=list(`Slope Aspect (degrees)`='E:/gis_data/region-2-mu-analysis/aspect_30.tif')
)

iterate over categorical variables and generate summaries via template

Once #28 is closed it should be possible to iterate over "categorical" type variables and generate a summary, using a template derived from the current geomorphons section.

test for "separation" between map units based on supervised classification results

Might be possible to test for univariate or multivariate separation between map units via supervised classification.

extract "categorical" section narratives from raster XML metadata

This would make it possible to iterate over items in the "categorical" slice of raster.list and generate a narrative.

create upgradeReport function

this would replace a .Rmd with the latest version, maybe even show differences.

.ipkCRAN() will not update existing packages

.ipkCRAN() should update existing packages if they are already installed. The current approach will skip any installed packages that match list of packages to install.

lab_summary_by_taxonname.Rmd

Need to fix resetting the factor levels with the nlcd classes.

"to check" polygon output

Add an additional .SHP file to output that includes information regarding polygons with above-threshold proportion of sample values outside 5-95% percentile range for the MU.

Currently, statistics shapefile output contains median values for each raster and the "toCheck" flag which is a ranking based on the amount of samples outside range. This polygon is useful for symbolizing where problematic polygons occur, and also the distribution of abiotic factors (vis-à-vis the median) across a MU extent... but does not tell why a particular polygon is flagged. This can rarely be determined from the median values unless it is a very extreme case.

The new shapefile will follow same format as the stats shapefile (1 column per raster) only instead of medians, contains the "proportion of samples outside range" for each raster data source.

This will reduce the iterative process of looking up polygon IDs to see why they were flagged. Currently that information is only available in the tabular output at the end of the report HTML file.

In addition to reducing iteration between the report and the shapefile display in e.g. ArcMap this will allow symbolizing of MUs based on proportion outside range for INDIVIDUAL data sources rather than an aggregate of all data sources.

add KSSL depth-slice summaries to MLRA report

This would require a pre-made slab-style database of aggregate data for all MLRA. New configuration options in config.R could allow for specification of 4-5 soil properties from 10-20 possible properties.

Sampling / Aggregating:

library(soilDB)
# iterate over MLRA in current shapefile or local list of codes
x <- fetchKSSL(mlra = '136')

# aggregate, property selection important
a <- slab(x, mlra ~ clay + estimated_ph_h2o + caco3 + bs82)

# save to file

Combine into database:

# assemble samples into single file and distribute

Symbolize in report

library(RColorBrewer)
library(lattice)

# adjust factor labels for MLRA to include number of pedons
pedons.per.mlra <- tapply(site(x)$mlra, site(x)$mlra, length)
a$mlra <- factor(a$mlra, levels=names(pedons.per.mlra), labels=paste(names(pedons.per.mlra), ' (', pedons.per.mlra, ' profiles)', sep=''))

# re-name variables
a$variable <- factor(a$variable, labels=c('Clay %', 'pH 1:1 Water', 'CaCO3 Equiv. (%)', 'Base Sat. pH 8.2'))

# make some nice colors
cols <- brewer.pal('Set1', n=7)

xyplot(
  top ~ p.q50 | variable, groups=mlra, data=a, lower=a$p.q25, upper=a$p.q75, 
  ylim=c(170,-5), alpha=0.25, scales=list(y=list(tick.num=7, alternating=3), x=list(relation='free',alternating=1)),
  panel=panel.depth_function, prepanel=prepanel.depth_function, sync.colors=TRUE, asp=1.5,
  ylab='Depth (cm)', xlab='median bounded by 25th and 75th percentiles', strip=strip.custom(bg=grey(0.85)),
  par.settings=list(superpose.line=list(col=cols, lty=c(1,2,3), lwd=2)),
  auto.key=list(columns=3, title='MLRA', points=FALSE, lines=TRUE),
  sub=paste(length(x), 'profiles')
)

abbreviateNames() does not always return unique names

This is typically only a problem when someone has created their input features (mu) with many tables joined to it. Some options:

filter out all columns except mu.col
smarter abbreviateNames()

The context of this error is:

poly.check.wide <- dcast(polygons.to.check, pID ~ variable, value.var = 'prop.outside.range')
poly.check.wide[is.na(poly.check.wide)] = 0 #replace NAs with zero (no samples outside 5-95% percentile range)

mu.check <- merge(mu, poly.check.wide, by='pID', all.x=TRUE)
names(mu.check)[-1] <- abbreviateNames(mu.check)

# fix names for printing
names(polygons.to.check)[1] <- mu.col

# print table (removed from report now that shapefile with proportions outside range is generated)
#kable(polygons.to.check[display.idx,], row.names = FALSE) #only shows polys with p.crit > 0.15 in report tabular output

#save a SHP file with prop.outside.range for each polygon and raster data source combination
if(nrow(polygons.to.check) > 0) {
  shp.fname <- paste0('poly-qc-', paste(mu.set, collapse='_'))
  writeOGR(mu.check, dsn='output', layer=shp.fname, driver='ESRI Shapefile', overwrite_layer=TRUE)
  write.csv(mu.check,file=paste0("output\\",shp.fname,".csv")) 
}

keep track of changes

NEWS file, or some other update of changes / by report. Perhaps a new list or vector in the report-metadata chunk.

Extending reportInit/reportSetup to handle a "report manifest"

Currently, we assume that every report is comprised of report.Rmd, config.R and setup.R files, but sometimes these are not the only documents that would need to be copied by reportInit().

Examples of this case are:

region11 lab summary reports (using custom.R)
Shiny interactive pedon summary report (currently under region2 in branch 'AGB')

A few questions for your consideration:

Should setup.R be extended to include a variable (e.g. files_to_copy; vector of directories and files to copy) that lists additional items to be copied when creating a new report instance?
Should we distinguish reports that have an interactive/dashboard component (i.e. 'has_dashboard' boolean that indicates presence of shiny.Rmd or equivalent)?
Should the report metadata (currently stored at top of report.Rmd) be moved to a set of metadata variables in setup.R?

multivariate and categorical blocks fail with NA in raster samples

These sections should be more robust in case of NA. This is typically caused by sampling rasters that do not extend beyond the extent of MU polygons.

Demonstration of sampling intensity

This is related to #11.

Some Region 2 folks are not convinced that the sampling approach is appropriate--they want to use all pixels within a suite of polygons.

Demonstrate appropriate sampling density via Moran's I and stability of the median as a function of sample size.

filter out "id" columns

Figure out a better way to filter out "id" columns from multivariate analysis.

specify categorical raster details in definitions file

Building off of #28...

  categorical=list(
    `Geomorphon Landforms`=list(
       data='L:/Geodata/DEM_derived/forms10.tif',
        legend=list(
          levels=1:10,
          labels=c('flat', 'summit', 'ridge', 'shoulder', 'spur', 'slope', 'hollow', 'footslope', 'valley', 'depression')
        )
     ),
  ...
 )

Error: Failure during raster IO

This is likely associated with sharpshootR::sampleRasterStackByMU(), and lower-level raster access as specified in the traceback:

 rgdal::getRasterData(con, offset = offs, region.dim = c(1, nc), 
    band = layers) 
13 .readCellsGDAL(x, uniquecells, layers) 
12 .readCells(x, cells, 1) 
11 .cellValues(object, cells, layer = layer, nl = nl) 
10 .xyValues(x, coordinates(y), ..., df = df) 
9 .local(x, y, ...) 
8 raster::extract(r, s) 
7 raster::extract(r, s) 
6 data.frame(value = raster::extract(r, s), pID = s$pID, sid = s$sid) 
5 (function (r) 
{
    res <- data.frame(value = raster::extract(r, s), pID = s$pID, 
        sid = s$sid) ... 
4 rapply(raster.list, how = "replace", f = function(r) {
    res <- data.frame(value = raster::extract(r, s), pID = s$pID, 
        sid = s$sid)
    return(res) ... 
3 sampleRasterStackByMU(mu, mu.set, mu.col, raster.list, pts.per.acre, 
    estimateEffectiveSampleSize = correct.sample.size) 
2 withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning")) 
1 suppressWarnings(sampleRasterStackByMU(mu, mu.set, mu.col, raster.list, 
    pts.per.acre, estimateEffectiveSampleSize = correct.sample.siz

Some ideas:

http://r-sig-geo.2731867.n2.nabble.com/Failure-during-raster-IO-td7582421.html
crop() before extract()

integrate pair-wise decile comparisons

Ideas, based on the WRS package:

Citation Here

link points in ordination to points on the ground

The ordination figure would be a lot more useful if there were some way to link points in the figure with points on the ground. This could be as simple as preserving coordinates / IDs and outputting sampling locations (cLHS sub-sample). Some degree of interaction is needed when plots are cluttered:

convert MU summary report to `sf` package

This will be a lot faster.

add an extentPlot() function

graphical display of extents