Giter Club home page Giter Club logo

gard's Introduction

Build Status Documentation Status

Generalized Analog Regression Downscaling (GARD)

This code is an implementation of a hybrid analog / regression multivariate downscaling procedure. The program reads a namelist file for the configuration information. The downscaling process is performed on a grid-cell by grid-cell basis and permits multiple approaches to downscaling. The standard hybrid analog-regression approach uses the input predictor variables to select a group of analog days (e.g. 300) from the training period for each day to be predicted. These analog days are then used to compute a multi-variable regression between the training data (e.g. wind, humidity, and stability) and the variable to be predicted (e.g. precipitation). The regression coefficients are then applied to the predictor variables to compute the expected downscaled value, and they are applied to the training data to compute the error in the regression. Optionally, a logistic regression can be used to compute (e.g.) the probability of precipitation occurrence on a given day, or the probability of exceeding any other threshold. Similarly, the logistic regression coefficients are applied to the predictors and output, or the analog exceedence probabilities can be output. Alternatively the code can compute the regressions over the entire time series supplied, a pure regression approach, or the analogs them selves can be used as the result, a pure analog approach. The pure analog approach can compute the mean of the selected analogs, it can randomly sample the analogs, or it can compute the weighted mean based on the distance from the current predictor.

The code requires both training and predictor data for the same variables as well as a variable to be predicted. The training and prediction data can include as many variables as desired (e.g. wind, humidity, precipitation, CAPE). All data must have geographic and time coordinate variables associated with them.

While this was developed for downscaling climate data, it is general purpose and could be applied to a wide variety of problems in which both analogs and regression make sense.

Documentation is being build on the GARD Readthedocs page.

gard's People

Contributors

andywood avatar gutmann avatar sam-hartke-ucar avatar skb-7 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gard's Issues

ability to compute min, max, or range for a variable

For forecasting, when multiple time intervals are included to make up one day, it might be a useful option to be able to compute e.g. the minimum temperature, maximum temperature, or diurnal temperature range over that time interval instead of only computing the mean.

Detrend data prior to downscaling

It is common in downscaling to remove a given trend (e.g. temperature) before applying any statistical downscaling, then reapply the trend after downscaling. Although there are issues surrounding this methodology, it is widely used, and it would be nice to provide an option for this. It is a little tricky to think about what this might mean logistically. Does GARD need to be able to read in an entire time series (~150yrs) in order to do this. It can be done on the coarse grid, so maybe it is feasible to fit all that data in Memory(?)

Improve Quantile Mapping

Quantile mapping routine should be configurable such that the number of quantiles is read in from the config/namelist file.

Also, the mapping itself should be improved by computing the mean over each CDF quantile instead of simply sampling the CDF.

Improve Logging

Some of GARD's logging/error statements are hidden behind the debug option. I just ran into a case where GARD exited silently due to a configuration error (misaligned training/obs dates). These sorts of messages should be printed regardless of the debug option setting.

It would actually be nice to have a bit more control on the logging of GARD runs where we can print at the Error/Warning/Info/Debug levels.

read / write coefficients

Add the ability to write out coefficients (i.e. from pure_regression) to an output file, and then read those same coefficients in to a subsequent run.

For pure regression runs this should drastically speed up the run time, though it won't work for analog-regression runs unless some standard set of weather types are first computed and used to compute a set of regression coefficients for each group.

Should GARD clobber existing output files?

Would it make sense to have GARD clobber existing output files? I recently ran a GARD simulation that took 4.5 hours, just to have it error when the simulation went to write the output...

 ---------------------------------------------------
            Time profiling information
 ---------------------------------------------------
 Total Time :       593774  s  (CPU time)
 Total Time :        16489  s (wall clock)
 ---------------------------------------------------
 Allocation :            0 %
 Data Init  :            0 %
 GeoInterp  :            0 %
 Transform  :            0 %
 Analog     :           20 %
 Regression :            3 %
 Log.Regres :           77 %
 Log.Analog :            0 %
 ---------------------------------------------------
 Parallelization overhead
 Residual   :            0 %
 ---------------------------------------------------
 ==========================================

 Writing output
 NetCDF: String match to name in use
 /gpfs/flash/jhamman/GARD_downscaling_20190423/NCAR_WRF_50km/analog_regression_3/19510101-19821231/gard_output.analog_regression_3.NCAR_WRF_50km.noresm.hist.19510101-19821231.pcp.nc:pcp

Perhaps there are actually two options here. 1) GARD should error error early if an output file already exists or 2) GARD should clobber the existing output file.

@gutmann - thoughts on how this should be handled?

Incorporate EOF analysis

Downscaling often uses a PC time series of EOFs computed over a larger spatial domain. It would be nice to let GARD pre-process a given variable (e.g. sea-level pressure) to compute EOFs and corresponding eigenvector weights.

Add a seasonal locality option

Many downscaling methods operate independently for each month of the year to help preserve the seasonal cycle.

This can be done in GARD by downscaling separately for each month, but this requires substantial pre-processing of input data (separating monthly data for independent GARD runs and stitching back together in the end)

It is relatively simple to add a “day of the year” variable to the input files, and use that (in any of the analog schemes anyway). To be able to wrap around the year it could be two variables (equivalent to “u” and “v” for wind direction with a unit length). That way days that are far away (in day of year) would have to be very similar for all other variables, while days that are close in day of year would be given more flexibility. It is a little awkward, but might help… It probably will not help the pure regression mode though.

Alternatively it would be nice to add a Day-Of-Year localization option in GARD.

For the pure regression mode this would be tricky, and require GARD to either call downscale_point ~12 (or whatever) times after subsetting the data appropriately each time, then stitching it back internally.

For the analog MODES, it should be possible for GARD to simply have an option (in select_analogs) to take time data and enforce that DoY be within X-days for any analogs selected. This part would be "easy"

Break up processing into chunks

To permit larger runs in limited memory it would be nice to be able to specify a chunk size in time (or space) over which GARD would process datasets.

Performing this split in time might be the most obvious (and goes along with splitting up the output files #11 )

Performing the split in space could be more efficient because GARD needs a complete time series for each grid point in the training period regardless of how much time is being processed. Would this end up writing separate files for each subset that need to be recombined later or could it write just the subset it processed to an output file that contains the entire grid?

Ideally both should be options, but do they need to be?

Tradeoffs in implementation complexity and benefit should be assessed briefly.

strip timers from code

Now that basic development is "done" all of the timers that compute how long GARD spent in each part of the code can be stripped out to clean up some sections.

Split outputfiles

Add an option to create monthly or yearly or decadal or... output files instead of just a single massive file.

output netcdf debug files when debug set to False

I am running GARD from the SHARP branch with the following config file settings. Even though debug is set to False, several netcdf files that are related to the debug option are still created: obs_preload_pcp.nc, training_preload_APCP_surface.nc, prediction_preload_APCP_surface.nc, obs_preload_t_mean.nc, training_preload_TMP_2maboveground.nc, prediction_preload_TMP_2maboveground.nc.

This seems to only happen when using the pure_regression option, and not the pass_through option. I haven't been able to find a place where the debug option is reset outside of the read_config() module.

&parameters
    output_file     = "/d2/hydrofcst/overtheloop/data/met_fcst/gefs/grid_nc/pure_regression/PNW/20170117/gard_output.pure_regression.20170117.p01.0dy.PNW."

    n_analogs        = 200

    start_date       = "2017-01-16 00:00:00"
    end_date         = "2017-01-16 00:00:00"

    start_train      = "2006-01-01 00:00:00"
    end_train        = "2009-12-31 00:00:00"

    start_transform  = "2006-01-01 00:00:00"
    end_transform    = "2009-12-31 00:00:00"

    pure_regression   = True
    pure_analog       = False
    analog_regression = False
    pass_through = False

    write_coefficients = False
    read_coefficients = True
    coefficients_files = /d2/hydrofcst/overtheloop/data/met_fcst/gefs/grid_nc/pure_regression/PNW/20170117/gard_output.pure_regression.20170113.p01.0dy.PNW.t_mean_coef.nc

    sample_analog = False
    logistic_from_analog_exceedance = False
    logistic_threshold = -9999

    debug = False
    interactive = False
/

&training_parameters
    name = "GEFS training data"
    interpolation_method = 2
    time_indices = 5,6,7,8,9,10,11,12
    agg_method = 0
    time_weights = 3,3,3,3,3,3,3,3
    nvars     = 1
    data_type = "GEFS"
    lat_name  = "latitude"
    lon_name  = "longitude"
    time_name = "time"
    nfiles    = 1461

    input_transformations = 0

    var_names = TMP_2maboveground
    file_list = "/d2/hydrofcst/overtheloop/data/met_fcst/gefs/grid_nc/gard_filelists/gard_filelist.gefs.TMP_2maboveground.mean.20060101-20091231.gefs.txt"

    selected_time = -1
    calendar  = "gregorian"
    timezone_offset = -8
/

&prediction_parameters
    name = "GEFS prediction data"
    interpolation_method = 2
    normalization_method = 2
    time_indices = 5,6,7,8,9,10,11,12
    time_weights = 3,3,3,3,3,3,3,3
    agg_method = 0
    nvars     = 1
    data_type = "GEFS"
    lat_name  = "latitude"
    lon_name  = "longitude"
    time_name = "time"
    nfiles    = 1

    input_transformations = 0

    var_names = TMP_2maboveground
    file_list = "/d2/hydrofcst/overtheloop/data/met_fcst/gefs/grid_nc/gard_filelists/gard_filelist.gefs.TMP_2maboveground.p01.20170117.gefs.txt"
    calendar  = "gregorian"
    timezone_offset = -8
/


&obs_parameters
    name = "SHARP Newman Ensemble"
    nvars     = 1
    nfiles    = 1
    data_type = "obs"
    lat_name  = "latitude"
    lon_name  = "longitude"
    time_name = "time"

    input_transformations = 0
    var_names = t_mean
    file_list = "/d2/hydrofcst/overtheloop/data/met_fcst/gefs/grid_nc/gard_filelists/gard_
filelist.obs.t_mean.mean.20091231.PNW.txt"

    calendar  = "gregorian"
/

Pure Analog: when to use days without precipitation

@gutmann and I just had a conversation about the use of (non-)precipitation days in the pure analog case in GARD.

Some things to look into:

  • Normalization - should non-precip days be used in mean/std?
  • analog selection - should non-precip days be used in analog selections? With/without sample_analog?

GH Pages Documentation

Create a gh-pages branch that shows the Doxygen (or other) generated documentation.

Many (most?) routines already have doxygen style documentation comments and the doxygen config file exists. This simply needs a branch set up in a subdirectory and pushed .

Do we need to check for recombination, if we have a species tree?

Since the effect of recombination is in phylogenetic tree inference from a multiple sequence alignment, if we have a species tree available (independent of MSA), do we need to check for recombination in the sequences using GARD to avoid its effect on selection analysis?

minval normalization will be a problem

At the moment, the normalization step subtracts off the minimum value to avoid any negative values. This is fine for precip, in which it is always 0, but for temperature, that shifts the location of the mean based on a single outlier minimum temperature.

This is fine for the Forecasting case in which training data are used to normalize predictors, in this case there is no discrepancy, but for climate model downscaling, this could be very bad.

add interactive namelist option

When debug=False GARD outputs a progress/status percentage that makes piped output to file quite difficult to read. @gutmann suggested a interactive namelist option that will allow users to limit the output when logging to a file.

Better checking of time data

We should have a series of checks to verify that the dates/times specified in the GARD namelist will work. These checks should basically cover these cases:

  • training period is available in both the training data and observations data
  • prediction period is available in the prediction data

GEFS Select Model Level

The GEFS IO code should use the options%selected_level field when deciding which atmospheric model level to read data from.

This is already implemented in the configuration, and is used in the gcm IO code, it just needs to be implemented in the GEFS code.

Test case

We should develop a (simple) standard test case or cases. Data should be hosted elsewhere, and could even just be located on yellowstone and/or hydro-c1 for now.

Fix normalization for short periods

When the prediction period is only 1 (or <~100) time steps, the normalization step will not work well of the predictor data.

If possible, the predictor data should be normalized instead using the mean and standard deviation from the training data (assuming it is the "same" and that it has a longer time period)

Read and apply SCRFs internally

Provide an option to GARD to read in pre-computed SCRF fields so they can be used with, e.g., sample_analog or simply used to apply the error and pop terms internally.

This could go along with only writing final output, instead of writing (and having to save) mean, error, and pop fields. This could save a lot of memory and enable longer runs without having to break it up.

Improve output files

Currently, output files have no attributes. At a minimum, output files should have well named variables with attributes. Better, we should probably combine output variables into a single file. This should include lat / lon / time variables.

Get fill value from obs files

The current behavior of GARD is to use a namelist option to specify the fill value in the input observations netCDF data. Instead, we should just check for the _FillValue attribute or a mask variable to determine the spatial domain for GARD.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.