ncar / gard Goto Github PK

Generalized Analog Regression Downscaling (GARD) code

License: MIT License

Fortran 84.53% Makefile 3.18% Shell 0.60% Python 10.70% C++ 0.99%

analog regression logistic-regression climate downscaling

gard's Introduction

Generalized Analog Regression Downscaling (GARD)

This code is an implementation of a hybrid analog / regression multivariate downscaling procedure. The program reads a namelist file for the configuration information. The downscaling process is performed on a grid-cell by grid-cell basis and permits multiple approaches to downscaling. The standard hybrid analog-regression approach uses the input predictor variables to select a group of analog days (e.g. 300) from the training period for each day to be predicted. These analog days are then used to compute a multi-variable regression between the training data (e.g. wind, humidity, and stability) and the variable to be predicted (e.g. precipitation). The regression coefficients are then applied to the predictor variables to compute the expected downscaled value, and they are applied to the training data to compute the error in the regression. Optionally, a logistic regression can be used to compute (e.g.) the probability of precipitation occurrence on a given day, or the probability of exceeding any other threshold. Similarly, the logistic regression coefficients are applied to the predictors and output, or the analog exceedence probabilities can be output. Alternatively the code can compute the regressions over the entire time series supplied, a pure regression approach, or the analogs them selves can be used as the result, a pure analog approach. The pure analog approach can compute the mean of the selected analogs, it can randomly sample the analogs, or it can compute the weighted mean based on the distance from the current predictor.

The code requires both training and predictor data for the same variables as well as a variable to be predicted. The training and prediction data can include as many variables as desired (e.g. wind, humidity, precipitation, CAPE). All data must have geographic and time coordinate variables associated with them.

While this was developed for downscaling climate data, it is general purpose and could be applied to a wide variety of problems in which both analogs and regression make sense.

Documentation is being build on the GARD Readthedocs page.

gard's People

Contributors

Stargazers

Watchers

Forkers

gutmann ritviksahajpal machineai shepherdmeng thenaomig razamora hydrofcst balinus weilin2018 nick-gauthier wcurrier ddlddl58 lmxb lhmet-forks benlazarine sam-hartke-ucar skb-7

gard's Issues

BUG: Spatial interpolation algorithm can produce negative precip

Interpolating across a large gradient (e.g. 25->0) can produce small negative values at what should be the 0 edge. This may be a nearly unavoidable rounding/truncation error, but needs to be investigated as it can create issues for precipitation.

ability to compute min, max, or range for a variable

For forecasting, when multiple time intervals are included to make up one day, it might be a useful option to be able to compute e.g. the minimum temperature, maximum temperature, or diurnal temperature range over that time interval instead of only computing the mean.

Detrend data prior to downscaling

It is common in downscaling to remove a given trend (e.g. temperature) before applying any statistical downscaling, then reapply the trend after downscaling. Although there are issues surrounding this methodology, it is widely used, and it would be nice to provide an option for this. It is a little tricky to think about what this might mean logistically. Does GARD need to be able to read in an entire time series (~150yrs) in order to do this. It can be done on the coarse grid, so maybe it is feasible to fit all that data in Memory(?)

Add UTC offset namelist option to align observation of GEFS data

Observation data is frequently represented in local time. It would be useful to be able to adjust the training/prediction data timestamps, which are often represented in UCT, on the fly.

Improve Quantile Mapping

Quantile mapping routine should be configurable such that the number of quantiles is read in from the config/namelist file.

Also, the mapping itself should be improved by computing the mean over each CDF quantile instead of simply sampling the CDF.

Improve Logging

Some of GARD's logging/error statements are hidden behind the debug option. I just ran into a case where GARD exited silently due to a configuration error (misaligned training/obs dates). These sorts of messages should be printed regardless of the debug option setting.

It would actually be nice to have a bit more control on the logging of GARD runs where we can print at the Error/Warning/Info/Debug levels.

read / write coefficients

Add the ability to write out coefficients (i.e. from pure_regression) to an output file, and then read those same coefficients in to a subsequent run.

For pure regression runs this should drastically speed up the run time, though it won't work for analog-regression runs unless some standard set of weather types are first computed and used to compute a set of regression coefficients for each group.

Should GARD clobber existing output files?

Would it make sense to have GARD clobber existing output files? I recently ran a GARD simulation that took 4.5 hours, just to have it error when the simulation went to write the output...

 ---------------------------------------------------
            Time profiling information
 ---------------------------------------------------
 Total Time :       593774  s  (CPU time)
 Total Time :        16489  s (wall clock)
 ---------------------------------------------------
 Allocation :            0 %
 Data Init  :            0 %
 GeoInterp  :            0 %
 Transform  :            0 %
 Analog     :           20 %
 Regression :            3 %
 Log.Regres :           77 %
 Log.Analog :            0 %
 ---------------------------------------------------
 Parallelization overhead
 Residual   :            0 %
 ---------------------------------------------------
 ==========================================

 Writing output
 NetCDF: String match to name in use
 /gpfs/flash/jhamman/GARD_downscaling_20190423/NCAR_WRF_50km/analog_regression_3/19510101-19821231/gard_output.analog_regression_3.NCAR_WRF_50km.noresm.hist.19510101-19821231.pcp.nc:pcp

Perhaps there are actually two options here. 1) GARD should error error early if an output file already exists or 2) GARD should clobber the existing output file.

@gutmann - thoughts on how this should be handled?

IO: check that input latitude coordinates are monotonically increasing

@gutmann and I ran across a problem in GARD output that appears to be caused during the interpolation step when the latitude coordinate is monotonically decreasing. At a minimum, GARD should raise an error if this condition is found to be true.

Incorporate EOF analysis

Downscaling often uses a PC time series of EOFs computed over a larger spatial domain. It would be nice to let GARD pre-process a given variable (e.g. sea-level pressure) to compute EOFs and corresponding eigenvector weights.

Add a seasonal locality option

Many downscaling methods operate independently for each month of the year to help preserve the seasonal cycle.

This can be done in GARD by downscaling separately for each month, but this requires substantial pre-processing of input data (separating monthly data for independent GARD runs and stitching back together in the end)

It is relatively simple to add a “day of the year” variable to the input files, and use that (in any of the analog schemes anyway). To be able to wrap around the year it could be two variables (equivalent to “u” and “v” for wind direction with a unit length). That way days that are far away (in day of year) would have to be very similar for all other variables, while days that are close in day of year would be given more flexibility. It is a little awkward, but might help… It probably will not help the pure regression mode though.

Alternatively it would be nice to add a Day-Of-Year localization option in GARD.

For the pure regression mode this would be tricky, and require GARD to either call downscale_point ~12 (or whatever) times after subsetting the data appropriately each time, then stitching it back internally.

For the analog MODES, it should be possible for GARD to simply have an option (in select_analogs) to take time data and enforce that DoY be within X-days for any analogs selected. This part would be "easy"

Separate mask capability

Add ability to specify a mask variable separately from the observational dataset.

Break up processing into chunks

To permit larger runs in limited memory it would be nice to be able to specify a chunk size in time (or space) over which GARD would process datasets.

Performing this split in time might be the most obvious (and goes along with splitting up the output files #11 )

Performing the split in space could be more efficient because GARD needs a complete time series for each grid point in the training period regardless of how much time is being processed. Would this end up writing separate files for each subset that need to be recombined later or could it write just the subset it processed to an output file that contains the entire grid?

Ideally both should be options, but do they need to be?

Tradeoffs in implementation complexity and benefit should be assessed briefly.

strip timers from code

Now that basic development is "done" all of the timers that compute how long GARD spent in each part of the code can be stripped out to clean up some sections.

Split outputfiles

Add an option to create monthly or yearly or decadal or... output files instead of just a single massive file.

output netcdf debug files when debug set to False

I am running GARD from the SHARP branch with the following config file settings. Even though debug is set to False, several netcdf files that are related to the debug option are still created: obs_preload_pcp.nc, training_preload_APCP_surface.nc, prediction_preload_APCP_surface.nc, obs_preload_t_mean.nc, training_preload_TMP_2maboveground.nc, prediction_preload_TMP_2maboveground.nc.

This seems to only happen when using the pure_regression option, and not the pass_through option. I haven't been able to find a place where the debug option is reset outside of the read_config() module.

&parameters
    output_file     = "/d2/hydrofcst/overtheloop/data/met_fcst/gefs/grid_nc/pure_regression/PNW/20170117/gard_output.pure_regression.20170117.p01.0dy.PNW."

    n_analogs        = 200

    start_date       = "2017-01-16 00:00:00"
    end_date         = "2017-01-16 00:00:00"

    start_train      = "2006-01-01 00:00:00"
    end_train        = "2009-12-31 00:00:00"

    start_transform  = "2006-01-01 00:00:00"
    end_transform    = "2009-12-31 00:00:00"

    pure_regression   = True
    pure_analog       = False
    analog_regression = False
    pass_through = False

    write_coefficients = False
    read_coefficients = True
    coefficients_files = /d2/hydrofcst/overtheloop/data/met_fcst/gefs/grid_nc/pure_regression/PNW/20170117/gard_output.pure_regression.20170113.p01.0dy.PNW.t_mean_coef.nc

    sample_analog = False
    logistic_from_analog_exceedance = False
    logistic_threshold = -9999

    debug = False
    interactive = False
/

&training_parameters
    name = "GEFS training data"
    interpolation_method = 2
    time_indices = 5,6,7,8,9,10,11,12
    agg_method = 0
    time_weights = 3,3,3,3,3,3,3,3
    nvars     = 1
    data_type = "GEFS"
    lat_name  = "latitude"
    lon_name  = "longitude"
    time_name = "time"
    nfiles    = 1461

    input_transformations = 0

    var_names = TMP_2maboveground
    file_list = "/d2/hydrofcst/overtheloop/data/met_fcst/gefs/grid_nc/gard_filelists/gard_filelist.gefs.TMP_2maboveground.mean.20060101-20091231.gefs.txt"

    selected_time = -1
    calendar  = "gregorian"
    timezone_offset = -8
/

&prediction_parameters
    name = "GEFS prediction data"
    interpolation_method = 2
    normalization_method = 2
    time_indices = 5,6,7,8,9,10,11,12
    time_weights = 3,3,3,3,3,3,3,3
    agg_method = 0
    nvars     = 1
    data_type = "GEFS"
    lat_name  = "latitude"
    lon_name  = "longitude"
    time_name = "time"
    nfiles    = 1

    input_transformations = 0

    var_names = TMP_2maboveground
    file_list = "/d2/hydrofcst/overtheloop/data/met_fcst/gefs/grid_nc/gard_filelists/gard_filelist.gefs.TMP_2maboveground.p01.20170117.gefs.txt"
    calendar  = "gregorian"
    timezone_offset = -8
/


&obs_parameters
    name = "SHARP Newman Ensemble"
    nvars     = 1
    nfiles    = 1
    data_type = "obs"
    lat_name  = "latitude"
    lon_name  = "longitude"
    time_name = "time"

    input_transformations = 0
    var_names = t_mean
    file_list = "/d2/hydrofcst/overtheloop/data/met_fcst/gefs/grid_nc/gard_filelists/gard_
filelist.obs.t_mean.mean.20091231.PNW.txt"

    calendar  = "gregorian"
/

Pure Analog: when to use days without precipitation

@gutmann and I just had a conversation about the use of (non-)precipitation days in the pure analog case in GARD.

Some things to look into:

Normalization - should non-precip days be used in mean/std?
analog selection - should non-precip days be used in analog selections? With/without sample_analog?

Read The Docs documentation

Read the Docs documentation needs to be expanded and the readthedocs website needs to be set up.

GH Pages Documentation

Create a gh-pages branch that shows the Doxygen (or other) generated documentation.

Many (most?) routines already have doxygen style documentation comments and the doxygen config file exists. This simply needs a branch set up in a subdirectory and pushed .

Do we need to check for recombination, if we have a species tree?

Since the effect of recombination is in phylogenetic tree inference from a multiple sequence alignment, if we have a species tree available (independent of MSA), do we need to check for recombination in the sequences using GARD to avoid its effect on selection analysis?

minval normalization will be a problem

At the moment, the normalization step subtracts off the minimum value to avoid any negative values. This is fine for precip, in which it is always 0, but for temperature, that shifts the location of the mean based on a single outlier minimum temperature.

This is fine for the Forecasting case in which training data are used to normalize predictors, in this case there is no discrepancy, but for climate model downscaling, this could be very bad.

add interactive namelist option

When debug=False GARD outputs a progress/status percentage that makes piped output to file quite difficult to read. @gutmann suggested a interactive namelist option that will allow users to limit the output when logging to a file.

Better checking of time data

We should have a series of checks to verify that the dates/times specified in the GARD namelist will work. These checks should basically cover these cases:

training period is available in both the training data and observations data
prediction period is available in the prediction data