jgcri / fldgen Goto Github PK

Given a global mean temperature pathway, generate random global climate fields consistent with it and with spatial and temporal correlation derived from an ESM

Home Page: https://jgcri.github.io/fldgen/

License: GNU General Public License v2.0

R 99.84% Shell 0.16%

climate climate-variability climate-science downscaling

fldgen's Introduction

fldgen 2.0: Climate variable field generator with internal variability and spatial, temporal, and inter-variable correlation.

The fldgen package provides functions to learn the spatial, temporal, and inter-variable correlation of the variability in an earth system model (ESM) and generate random two-variable fields (e.g., temperature and precipitation) with equivalent properties.

Installation

The easiest way to install fldgen is using install_github from the devtools package.

install_github('JGCRI/fldgen', build_vignettes=TRUE)

This will get you the latest stable development version of the model. If you wish to install a specific release, you can do so by specifying the desired release as the ref argument to this function call.
Current and past releases are listed in the release table at our GitHub repository.

Example

The data used in this example are installed with the package. They can be found in system.file('extdata', package='fldgen').

library(fldgen)
datadir <- system.file('extdata', package='fldgen')


infileT <- file.path(datadir, 'tas_annual_esm_rcp_r2i1p1_2006-2100.nc')
infileP <- file.path(datadir, 'pr_annual_esm_rcp_r2i1p1_2006-2100.nc')
emulator <- trainTP(c(infileT, infileP),
                    tvarname = "tas", tlatvar='lat', tlonvar='lon',
                    tvarconvert_fcn = NULL,
                    pvarname = "pr", platvar='lat', plonvar='lon',
                    pvarconvert_fcn = log)


residgrids <- generate.TP.resids(emulator, ngen = 3)

fullgrids <- generate.TP.fullgrids(emulator, residgrids,
                                   tgav = tgav,
                                   tvarunconvert_fcn = NULL,
                                   pvarunconvert_fcn = exp,
                                   reconstruction_function = pscl_apply)

A more detailed example can be found in the tutorial vignette included with the package.

vignette('tutorial2','fldgen')

fldgen's People

Contributors

Stargazers

Watchers

Forkers

jl11237 rplzzz nkholod yangxhcaf mshiv pkufubo

fldgen's Issues

Create and archive trained emulators for popular ESMs

The hardest part of training an emulator is getting hold of the ESM data in the first place. We should train joint temperature-precipitation emulators for all of the more well-known models in the ensemble. We can archive those and provide functions to download them from the archive, saving most users the hassle of wrangling the raw ESM data.

clean up comments and move internal functions in normalizeresiduals.R to be external

address comments here #14

Quantile transform sometimes generates NA values

The problem seems to be with the interpolation function used to approximate the quantile function:

fldgen/R/normalizeresiduals.R

Lines 176 to 187 in d7922a4

 inv.ecdf <- function(column.resids) { 

 # for one column of the residuals, get the order of points. 

 order.vector <- rank(column.resids) 

 # cumulative probabilities are order-1 / Ntime+1 

 probabilities <- (order.vector - 1) * offset 

 # return the ecdf for the function 

 approxfun(x = probabilities, y = column.resids) 

 }

By default, approxfun produces NA values for inputs outside the range of x-values used to construct the approximation. In theory, this should never happen, but due to roundoff error the largest or smallest value in the empirical distribution can sometimes appear to be out of range. The easiest fix will be to specify rule=2 in the call to approxfun, which will cause these values to be clamped to the range of the x-values used to construct the approximator, which is what we want anyhow.

Add package down to fldgen

Consolidate the documentation and make it reproducible using https://pkgdown.r-lib.org/articles/pkgdown.html

Add github actions to the repo

Add github actions to the repo so that when features are added there is automated testing when PRs are opened.

Instructions in tutorial2 are incorrect

In tutorial2 we have the following instructions for training a model in a single step

infileT <- file.path(datadir, 'tas_annual_esm_rcp_r2i1p1_2006-2100.nc')
infileP <- file.path(datadir, 'pr_annual_esm_rcp_r2i1p1_2006-2100.nc')
emulator <- trainTP(c(infileT, infileP), 
                    tvarname = "tas", tlatvar='lat', tlonvar='lon',
                    tvarconvert_fcn = NULL,
                    pvarname = "pr", platvar='lat', plonvar='lon',
                    pvarconvert_fcn = log)

However, in these input files, the names of the latitude and longitude coordinates are actually lat_2 and lon_2 (there are lat and lon variables, but they are length 1 and contain just the single value 0 -- I have no idea why it was generated that way).

Because this code isn't run, the package checks didn't catch the error. If you look further down where we do run the code, step by step, we use the correct versions.

Correction: It's only the tas input that has the funky coordinate names; the pr input has regular names.

Update method to give cov(phi_i, phi_j) = 0 for i !=j

Add tutorials to actually go through emulator validation

Currently, we have offline r markdowns doing detailed emulator validation that Abigail just...emails to different people. That's not robust. Incorporate as additional tutorials in the package

store netcdf data on zenodo so R package utils can pull it, and an example tutorial emulator can be trained o n multiple realizations without including in the package.
train
validate trained EOFs
generate realizations
validate generated realizations

These tutorials could then be copied and adjusted by users for other emulators
fldgen-analysis.zip

Update package man page in fldgen.R

We should update the examples section to include an example using the new two-variable functionality.

(This text is duplicated in README.md, so we should update it there too.

Check with full GCM data - impacts of training on full data or just land data

How can you get ENSO without ocean cells?

Is this approach actually justified for the ISIMIP bias-corrected data over land that we use for drought experiments?

Test on some full GCM data to investigate.

Prove property 1A

Prove that the variance of a grid cell in generated data is equal to the variance of the same cell in the input data.

proofs about means

mu = 0.
mu_i = 0 for all i
(mu_i)^2 = (mu*_i)^2 for all i
mu*_i = 0 [obvious from 2 and 3]
mu* = 0 [obvious from 4]

Problem with testing fldgen on developmental version of R

We were originally using github actions to test fldgen on across platforms and on two versions of R, the current version (3.6) and the developmental version. But our dependency on ncdf4 breaks the package on dev R. As of 2020-04-30 (see 00239e0) we've removed this test but ensuring that fldgen can run on newer versions of R will be an issue to address in the future.

globalop field in griddata is incorrect when grid cells are missing

When part of a grid is missing, as, for example, in the land-only data files provided by ISIMIP, we drop all of the grid cells containing NA values:

fldgen/R/handle_NAs.R

Lines 50 to 55 in c8e220d

 ## Find the grid cells with valid data 

 valid_cells <- which(nmiss == 0) 

 ## Drop missing cells from vardata and globalop 

 griddata$vardata <- griddata$vardata[ , valid_cells] 

 griddata$globalop <- griddata$globalop[valid_cells, ]

The globalop field is the global mean operator; we have to drop the missing values from there too. However, for the global mean operator to do its job, it has to be scaled so that it sums up to 1, and we don't rescale it here. This causes our global mean temperatures to be bogus.

> emu <- train_models('IPSL-CM5A-LR')
> sum(emu$griddataT$globalop)
[1] 0.2869764
> summary(emu$tgav)
       V1       
 Min.   :81.73  
 1st Qu.:82.01  
 Median :82.16  
 Mean   :82.39  
 3rd Qu.:82.73  
 Max.   :84.28

update stickers on the readme

no rush
right now read me includes stickers for travis and appveyor. should update at some point to reflect that we are now using github actions

Misleading documentation in generate.TP.fullgrids

#' This function takes in a trained emulator - a structure of class
#' \code{fldgen}.
#' This structure contains everything the emulator has learned about the model,
#' and is used to generate new fields of residuals. Also taking in a list of
#' generated residual fields and a global average yield, a global gridded
#' mean field is constructed accoring to the input reconstruction_function.
#' The mean field and residual fields are added to return a list of different
#' realizations of full fields.

This passage suggests that in this function we will be generating new residuals, but in fact the residuals have to be generated outside of the function and passed in. Also, the text could be clarified as to how the probability distribution transform is applied, and we should drop the reference to "yield", which is confusing.

Clean up figures

Get figures publication-ready. Clean up things like legends, fonts, line weights, etc.

Meanfield arrays are not converted back to native units.

In generate.TP.fullgrids, the mean field is not converted back to natural units. The generated fields are converted here:
https://github.com/JGCRI/fldgen/blob/master/R/generateTPfullgrids.R#L106-L123

It would be an easy fix to make this conversion on lines 110 and 119, but first we have to make sure that there isn't anywhere else where we assume that it hasn't been converted.

Handle multiple input time series

Update the code to handle multiple input time series. For the mean field model and the EOF decomposition, this is trivial; both of those algorithms operate on the rows of the input data independently, so you can just concatenate the data sets and pass them to the relevant functions.

For the power spectrum analysis, you have to do each Fourier transform individually and average the magnitudes of the power spectra found from each time series. That introduces some minor complications because you need to keep a record of which rows in the grand data matrix correspond to which input time series. It’s important to do the bookkeeping in a way that doesn’t result in a lot of data copying.

Write first draft of discussion section

'reading and writing models works' test fails

the 'reading and writing models works' test_writedata.R fails it contains the following comments

## These tests are kind of long, since the fldgen objects are rather large.
## Therefore, skip them on the CI environments, which are somewhat
## time-sensitive.  And since we're skipping them on the CI, I also haven't
## included the data file in the repository (it is quite large for the time
## being).  We can revisit this once we find a way to shrink the memory
## footprint of the fldgen objects.

Fix or remove appveyor build

It seems to be failing because the build tools aren't configured properly.

fully define and use classes for composite structures

e.g. #14

rplzzz 38 minutes ago Member
This is a case where we could really benefit from defining classes for all of our composite structures. It would make describing what the functions are expecting easier, and we could even check the class attributes in functions that are expecting structures.

fix the memory bloat of a saved emulator

Option 1:
Use something other than approx_fun to characterize the empirical CDF for each grid cell with fewer variables in R/normalizeresiduals.R #14

Option 2:
If Option 1 doesn't bring down the size of a saved emulator by enough, cry and think of something else

Review v2 code and vignettes

revisit warnings - variable labels - units

We add a lot of warning messages that only the package developers really understand, especially around converting units of things from P to logP and back again, while leaving T alone (eg, it prints out a warning message that T didn't change .

A lot of this was because we wrote the code so that it didn't matter whether variable 1 was temperature, or precip, or something else.

If we make firmer assumptions about what each variable is, we can create variable specific warning messages/print statements that are more useful.

Maybe add more explicit unit checks to the read functions (e.g. check if T units are in K or C)

Test installing from github

Make sure that the fldgen v2 release candidate installs cleanly with install_github. I always install from a local repository, so I don't think I've ever tried it.

Write first draft of appendix

Add CITATION file

Add a CITATION file to provide instructions for citing the package. There's an example in the food demand package: https://github.com/JGCRI/food-demand/blob/rpkg/inst/CITATION

One tricky thing here is that we would like to have version 2 reference both the original v1 paper and the v2 software paper. However, we will need to have a finalized v2 release to submit the paper. I think we can tiptoe around this issue by designating the release that we make when the paper goes out as a pre-release (and calling it something like 2.0.0-beta) and then finalizing it after we have some bibliographic information. I'll look into what options we have.

clean up tutorial2 in the vignettes

comments here #14

planned potential changes to get to v2.1

Sometime this summer depending on staffing.

Make the 1var training function capable of handling training data with NA values. Create functions for 1var residual generating and full field generating. This would be just taking all the 2var stuff I’ve written and adapting.
Make a tutorial0 focused on the one step training and generating functions: one variable, two variable, data with NA values, and using a different tgav than that calculated from the data.
Train TP emulators for a handful of major ESMs and put them on Zenodo. Then we can include the link to that in the github repo.
Make the names of the functions in fldgen more generic: train2var instead of trainTP, using things like var1, var2 instead of T and P. The actual science is all the same - our 2 var functions should work for any 2 related variables, not just TP. So we could make the names reflect that.

remove dependency on gcammaptools

This consistently causes issues in installations and testing. As far as I can tell, we only use a call to gcammaptools in the plot_field function in fldgen/R/plot.R and then in calls to plot_field from the vignettes. This is a handy convenience function that makes pretty maps, but in no way core to the science we're trying to do.

@kdorheim track down and comment out calls to plot_field. And comment out most of the interior of plot_field as well. Hopefully not too much to track down and then that lets us move on to the next issue with setting up testing.
@abigailsnyder implement a different interior of plot_field that doesn't require call to gcammaptools. Will probably sacrifice having a nice map projection in the short-term at least. Just do a quick rectangular map plotting so that we have a convenience function that works and we can call from vignettes. We can improve the aesthetics of its maps down the road or not (a user could obviously make their own map if they really hated ours).

Remove unused function gen_random_delta

This function was related to an early version of the phase randomization that we don't use anymore. It's not called anywhere and can safely be deleted.

Fix failing tests

Although everything appears to be working properly in practical use, a few of the tests are still failing. Circle back and fix these.

Prove property 1B

Prove that the values in a grid cell of the generated data are (approximately) normally distributed (across time and between realizations).

I believe we can appeal to the CLT for this one. Therefore, the property is only approximately true, but our empirical results seem to indicate that it is more than close enough for our purposes.

Prove property 3

Prove that the spatial correlation in the generated fields is the same as the spatial correlation function for the input data.

We have a little flexibility in how we go about this. Actually computing the spherical ACF is a pain, but we might be able to finesse that because we don't actually need to compute the ACF to prove that the ACFs for the two fields are equal. Alternatively, it might be sufficient to show that the covariance matrices for the two fields are the same.

Add option to splice a historical scenario onto each future scenario

In the past, we've run only future scenarios, but for an upcoming experiment we need to be able to run fldgen over datasets that span the entire historical and future periods. The easiest way to do this is to add an (optional) argument to specify a historical scenario to splice onto the beginning of each future scenario.

Once we've done that, we can treat each spliced scenario almost exactly as if it were provided as a single scenario. One exception is that when doing the linear fits, we might want to arrange for duplicated historical scenarios to be included in the fit only once, so as not to bias the fit. This will require a bit of extra record keeping in the tagout structure, but otherwise should be pretty easy to do.

downsized data sets to speed up testing

Currently devtools::test() takes a long time because our included sample data is large. At some point, we could figure out smaller sample data that works for the tests just as well.

Write first draft of Results section

Scale data when computing EOFs, or not?

The code currently does not scale the residual data to unit variance, nor, for that matter, does it center it. The reason we chose not to scale is that it slightly complicates the procedure for reconstructing fields from the EOFs. However, conventional wisdom is that scaling is advisable.

Questions:

Does scaling add value to the process sufficient to justify the extra complexity?
What, if any, harm can not scaling introduce?

Prove property 2

Prove that within a single run the autocorrelation function for a grid cell is equal to that of the corresponding cell in the reference data.

This is guaranteed for the basis functions by the Wiener-Khinchin theorem. We need to show that applying the W-K theorem to each basis function independently produces the appropriate ACF for each grid cell individually.

variability in Tgav and using Hector to drive trained ESM emulators

Documenting points here:

fldgen successfully deals with variability about the ESM_Tloc-ESM_Tgav regression.
ESM Tgav trajectories have interannual and decadal variability. However, Hector's Tgav trajectories are significantly smoother, even when calibrated to a specific ESM.
Work linking Hector and fldgen will need to take that into consideration and quantify or correct for that discrepancy

	inv.ecdf <- function(column.resids) {

	# for one column of the residuals, get the order of points.
	order.vector <- rank(column.resids)

	# cumulative probabilities are order-1 / Ntime+1
	probabilities <- (order.vector - 1) * offset

	# return the ecdf for the function
	approxfun(x = probabilities, y = column.resids)

	}

	## Find the grid cells with valid data
	valid_cells <- which(nmiss == 0)

	## Drop missing cells from vardata and globalop
	griddata$vardata <- griddata$vardata[ , valid_cells]
	griddata$globalop <- griddata$globalop[valid_cells, ]