Giter Club home page Giter Club logo

fldgen's Introduction

fldgen 2.0: Climate variable field generator with internal variability and spatial, temporal, and inter-variable correlation.

Travis-CI Build Status AppVeyor Build Status Coverage Status DOI

The fldgen package provides functions to learn the spatial, temporal, and inter-variable correlation of the variability in an earth system model (ESM) and generate random two-variable fields (e.g., temperature and precipitation) with equivalent properties.

Installation

The easiest way to install fldgen is using install_github from the devtools package.

install_github('JGCRI/fldgen', build_vignettes=TRUE)

This will get you the latest stable development version of the model. If you wish to install a specific release, you can do so by specifying the desired release as the ref argument to this function call.
Current and past releases are listed in the release table at our GitHub repository.

Example

The data used in this example are installed with the package. They can be found in system.file('extdata', package='fldgen').

library(fldgen)
datadir <- system.file('extdata', package='fldgen')


infileT <- file.path(datadir, 'tas_annual_esm_rcp_r2i1p1_2006-2100.nc')
infileP <- file.path(datadir, 'pr_annual_esm_rcp_r2i1p1_2006-2100.nc')
emulator <- trainTP(c(infileT, infileP),
                    tvarname = "tas", tlatvar='lat', tlonvar='lon',
                    tvarconvert_fcn = NULL,
                    pvarname = "pr", platvar='lat', plonvar='lon',
                    pvarconvert_fcn = log)


residgrids <- generate.TP.resids(emulator, ngen = 3)

fullgrids <- generate.TP.fullgrids(emulator, residgrids,
                                   tgav = tgav,
                                   tvarunconvert_fcn = NULL,
                                   pvarunconvert_fcn = exp,
                                   reconstruction_function = pscl_apply)

A more detailed example can be found in the tutorial vignette included with the package.

vignette('tutorial2','fldgen')

fldgen's People

Contributors

abigailsnyder avatar kdorheim avatar rplzzz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fldgen's Issues

Create and archive trained emulators for popular ESMs

The hardest part of training an emulator is getting hold of the ESM data in the first place. We should train joint temperature-precipitation emulators for all of the more well-known models in the ensemble. We can archive those and provide functions to download them from the archive, saving most users the hassle of wrangling the raw ESM data.

Quantile transform sometimes generates NA values

The problem seems to be with the interpolation function used to approximate the quantile function:

inv.ecdf <- function(column.resids) {
# for one column of the residuals, get the order of points.
order.vector <- rank(column.resids)
# cumulative probabilities are order-1 / Ntime+1
probabilities <- (order.vector - 1) * offset
# return the ecdf for the function
approxfun(x = probabilities, y = column.resids)
}

By default, approxfun produces NA values for inputs outside the range of x-values used to construct the approximation. In theory, this should never happen, but due to roundoff error the largest or smallest value in the empirical distribution can sometimes appear to be out of range. The easiest fix will be to specify rule=2 in the call to approxfun, which will cause these values to be clamped to the range of the x-values used to construct the approximator, which is what we want anyhow.

Instructions in tutorial2 are incorrect

In tutorial2 we have the following instructions for training a model in a single step

infileT <- file.path(datadir, 'tas_annual_esm_rcp_r2i1p1_2006-2100.nc')
infileP <- file.path(datadir, 'pr_annual_esm_rcp_r2i1p1_2006-2100.nc')
emulator <- trainTP(c(infileT, infileP), 
                    tvarname = "tas", tlatvar='lat', tlonvar='lon',
                    tvarconvert_fcn = NULL,
                    pvarname = "pr", platvar='lat', plonvar='lon',
                    pvarconvert_fcn = log)

However, in these input files, the names of the latitude and longitude coordinates are actually lat_2 and lon_2 (there are lat and lon variables, but they are length 1 and contain just the single value 0 -- I have no idea why it was generated that way).

Because this code isn't run, the package checks didn't catch the error. If you look further down where we do run the code, step by step, we use the correct versions.

Correction: It's only the tas input that has the funky coordinate names; the pr input has regular names.

Add tutorials to actually go through emulator validation

Currently, we have offline r markdowns doing detailed emulator validation that Abigail just...emails to different people. That's not robust. Incorporate as additional tutorials in the package

  • store netcdf data on zenodo so R package utils can pull it, and an example tutorial emulator can be trained o n multiple realizations without including in the package.
  • train
  • validate trained EOFs
  • generate realizations
  • validate generated realizations

These tutorials could then be copied and adjusted by users for other emulators
fldgen-analysis.zip

Update package man page in fldgen.R

We should update the examples section to include an example using the new two-variable functionality.

(This text is duplicated in README.md, so we should update it there too.

Prove property 1A

Prove that the variance of a grid cell in generated data is equal to the variance of the same cell in the input data.

proofs about means

  1. mu = 0.
  2. mu_i = 0 for all i
  3. (mu_i)^2 = (mu*_i)^2 for all i
  4. mu*_i = 0 [obvious from 2 and 3]
  5. mu* = 0 [obvious from 4]

Problem with testing fldgen on developmental version of R

We were originally using github actions to test fldgen on across platforms and on two versions of R, the current version (3.6) and the developmental version. But our dependency on ncdf4 breaks the package on dev R. As of 2020-04-30 (see 00239e0) we've removed this test but ensuring that fldgen can run on newer versions of R will be an issue to address in the future.

globalop field in griddata is incorrect when grid cells are missing

When part of a grid is missing, as, for example, in the land-only data files provided by ISIMIP, we drop all of the grid cells containing NA values:

fldgen/R/handle_NAs.R

Lines 50 to 55 in c8e220d

## Find the grid cells with valid data
valid_cells <- which(nmiss == 0)
## Drop missing cells from vardata and globalop
griddata$vardata <- griddata$vardata[ , valid_cells]
griddata$globalop <- griddata$globalop[valid_cells, ]

The globalop field is the global mean operator; we have to drop the missing values from there too. However, for the global mean operator to do its job, it has to be scaled so that it sums up to 1, and we don't rescale it here. This causes our global mean temperatures to be bogus.

> emu <- train_models('IPSL-CM5A-LR')
> sum(emu$griddataT$globalop)
[1] 0.2869764
> summary(emu$tgav)
       V1       
 Min.   :81.73  
 1st Qu.:82.01  
 Median :82.16  
 Mean   :82.39  
 3rd Qu.:82.73  
 Max.   :84.28  

update stickers on the readme

no rush
right now read me includes stickers for travis and appveyor. should update at some point to reflect that we are now using github actions

Misleading documentation in generate.TP.fullgrids

#' This function takes in a trained emulator - a structure of class
#' \code{fldgen}.
#' This structure contains everything the emulator has learned about the model,
#' and is used to generate new fields of residuals. Also taking in a list of
#' generated residual fields and a global average yield, a global gridded
#' mean field is constructed accoring to the input reconstruction_function.
#' The mean field and residual fields are added to return a list of different
#' realizations of full fields.

This passage suggests that in this function we will be generating new residuals, but in fact the residuals have to be generated outside of the function and passed in. Also, the text could be clarified as to how the probability distribution transform is applied, and we should drop the reference to "yield", which is confusing.

Clean up figures

Get figures publication-ready. Clean up things like legends, fonts, line weights, etc.

Handle multiple input time series

Update the code to handle multiple input time series. For the mean field model and the EOF decomposition, this is trivial; both of those algorithms operate on the rows of the input data independently, so you can just concatenate the data sets and pass them to the relevant functions.

For the power spectrum analysis, you have to do each Fourier transform individually and average the magnitudes of the power spectra found from each time series. That introduces some minor complications because you need to keep a record of which rows in the grand data matrix correspond to which input time series. It’s important to do the bookkeeping in a way that doesn’t result in a lot of data copying.

'reading and writing models works' test fails

the 'reading and writing models works' test_writedata.R fails it contains the following comments

## These tests are kind of long, since the fldgen objects are rather large.
## Therefore, skip them on the CI environments, which are somewhat
## time-sensitive.  And since we're skipping them on the CI, I also haven't
## included the data file in the repository (it is quite large for the time
## being).  We can revisit this once we find a way to shrink the memory
## footprint of the fldgen objects.

fully define and use classes for composite structures

e.g. #14

rplzzz 38 minutes ago Member
This is a case where we could really benefit from defining classes for all of our composite structures. It would make describing what the functions are expecting easier, and we could even check the class attributes in functions that are expecting structures.

fix the memory bloat of a saved emulator

Option 1:
Use something other than approx_fun to characterize the empirical CDF for each grid cell with fewer variables in R/normalizeresiduals.R #14

Option 2:
If Option 1 doesn't bring down the size of a saved emulator by enough, cry and think of something else

revisit warnings - variable labels - units

  1. We add a lot of warning messages that only the package developers really understand, especially around converting units of things from P to logP and back again, while leaving T alone (eg, it prints out a warning message that T didn't change .

A lot of this was because we wrote the code so that it didn't matter whether variable 1 was temperature, or precip, or something else.

If we make firmer assumptions about what each variable is, we can create variable specific warning messages/print statements that are more useful.

  1. Maybe add more explicit unit checks to the read functions (e.g. check if T units are in K or C)

Test installing from github

Make sure that the fldgen v2 release candidate installs cleanly with install_github. I always install from a local repository, so I don't think I've ever tried it.

Add CITATION file

Add a CITATION file to provide instructions for citing the package. There's an example in the food demand package: https://github.com/JGCRI/food-demand/blob/rpkg/inst/CITATION

One tricky thing here is that we would like to have version 2 reference both the original v1 paper and the v2 software paper. However, we will need to have a finalized v2 release to submit the paper. I think we can tiptoe around this issue by designating the release that we make when the paper goes out as a pre-release (and calling it something like 2.0.0-beta) and then finalizing it after we have some bibliographic information. I'll look into what options we have.

planned potential changes to get to v2.1

Sometime this summer depending on staffing.

  1. Make the 1var training function capable of handling training data with NA values. Create functions for 1var residual generating and full field generating. This would be just taking all the 2var stuff I’ve written and adapting.
  2. Make a tutorial0 focused on the one step training and generating functions: one variable, two variable, data with NA values, and using a different tgav than that calculated from the data.
  3. Train TP emulators for a handful of major ESMs and put them on Zenodo. Then we can include the link to that in the github repo.
  4. Make the names of the functions in fldgen more generic: train2var instead of trainTP, using things like var1, var2 instead of T and P. The actual science is all the same - our 2 var functions should work for any 2 related variables, not just TP. So we could make the names reflect that.

remove dependency on gcammaptools

This consistently causes issues in installations and testing. As far as I can tell, we only use a call to gcammaptools in the plot_field function in fldgen/R/plot.R and then in calls to plot_field from the vignettes. This is a handy convenience function that makes pretty maps, but in no way core to the science we're trying to do.

  • @kdorheim track down and comment out calls to plot_field. And comment out most of the interior of plot_field as well. Hopefully not too much to track down and then that lets us move on to the next issue with setting up testing.

  • @abigailsnyder implement a different interior of plot_field that doesn't require call to gcammaptools. Will probably sacrifice having a nice map projection in the short-term at least. Just do a quick rectangular map plotting so that we have a convenience function that works and we can call from vignettes. We can improve the aesthetics of its maps down the road or not (a user could obviously make their own map if they really hated ours).

Fix failing tests

Although everything appears to be working properly in practical use, a few of the tests are still failing. Circle back and fix these.

Prove property 1B

Prove that the values in a grid cell of the generated data are (approximately) normally distributed (across time and between realizations).

I believe we can appeal to the CLT for this one. Therefore, the property is only approximately true, but our empirical results seem to indicate that it is more than close enough for our purposes.

Prove property 3

Prove that the spatial correlation in the generated fields is the same as the spatial correlation function for the input data.

We have a little flexibility in how we go about this. Actually computing the spherical ACF is a pain, but we might be able to finesse that because we don't actually need to compute the ACF to prove that the ACFs for the two fields are equal. Alternatively, it might be sufficient to show that the covariance matrices for the two fields are the same.

Add option to splice a historical scenario onto each future scenario

In the past, we've run only future scenarios, but for an upcoming experiment we need to be able to run fldgen over datasets that span the entire historical and future periods. The easiest way to do this is to add an (optional) argument to specify a historical scenario to splice onto the beginning of each future scenario.

Once we've done that, we can treat each spliced scenario almost exactly as if it were provided as a single scenario. One exception is that when doing the linear fits, we might want to arrange for duplicated historical scenarios to be included in the fit only once, so as not to bias the fit. This will require a bit of extra record keeping in the tagout structure, but otherwise should be pretty easy to do.

downsized data sets to speed up testing

Currently devtools::test() takes a long time because our included sample data is large. At some point, we could figure out smaller sample data that works for the tests just as well.

Scale data when computing EOFs, or not?

The code currently does not scale the residual data to unit variance, nor, for that matter, does it center it. The reason we chose not to scale is that it slightly complicates the procedure for reconstructing fields from the EOFs. However, conventional wisdom is that scaling is advisable.

Questions:

  1. Does scaling add value to the process sufficient to justify the extra complexity?
  2. What, if any, harm can not scaling introduce?

Prove property 2

Prove that within a single run the autocorrelation function for a grid cell is equal to that of the corresponding cell in the reference data.

This is guaranteed for the basis functions by the Wiener-Khinchin theorem. We need to show that applying the W-K theorem to each basis function independently produces the appropriate ACF for each grid cell individually.

variability in Tgav and using Hector to drive trained ESM emulators

Documenting points here:

  • fldgen successfully deals with variability about the ESM_Tloc-ESM_Tgav regression.
  • ESM Tgav trajectories have interannual and decadal variability. However, Hector's Tgav trajectories are significantly smoother, even when calibrated to a specific ESM.
  • Work linking Hector and fldgen will need to take that into consideration and quantify or correct for that discrepancy

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.