fzhu2e / cfr Goto Github PK

A Python package for Climate Field Reconstruction

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.01% Python 0.42% Jupyter Notebook 99.57% Shell 0.01% Batchfile 0.01% CSS 0.01%

climate-data-toolbox climate-field-reconstruction graphical-expectation-maximization paleoclimate-data-assimilation proxy-system-modeling

cfr's People

Contributors

Stargazers

Watchers

Forkers

zilum kindredfff

cfr's Issues

Aligning ProxyDatabase methods to ProxyRecords methods

methods like standardize() or center() reinvent the wheel ; the database method needs to re-use the records method

Standardize needs to check that data are available over the reference period

Right now, a RuntimeWarning: divide by zero encountered in divide is returned, but that corrupts the results; the method should instead remove records from the db if they don't have any overlap over the reference period.

Trying to annualize records in the PAGE2k database - Error 'year 0 is out of range'

Code to reproduce:

pdb.annualize()

Using the example from: https://fzhu2e.github.io/cfr/notebooks/proxy-ops.html

name 'PyVSL' is not defined

when I run the step:
mdl.calibrate()
it cames some problem
name 'PyVSL' is not defined
I try to insatll PyVSL,but it seems not work

For reconstruction jobs, use an API to load the data instead of a pickle file.

The load_proxy method for the class ReconJob requires a pickle file (which I'm guessing has fixed keys). Would like an alternative as was done for ProxyDatabase().from_df, specifying the columns. Example:

pdb = cfr.ProxyDatabase().from_df(df, pid_column='dsname', lat_column='lat', lon_column='lon', time_column='timeval', value_column='val', proxy_type_column='proxyobs', archive_type_column='archive', value_name_column='varname', value_unit_column='varunits')

Re-use pyleoclim infrastructure

Currently, cfr only uses pyleoclim for spectral analysis. However, the ProxyRecord class is basically GeoSeries; only "seasonality' is missing.

Making ProxyRecord a child of GeoSeries would enable Pyleoclim functionalities, particularly:

pandas conversion with to_pandas()/from_pandas()
time unit conversions
standardize()
center()
bin()
and maybe more

Warning on z-score calculation for proxy composites

The proxy composite scores really only make sense by proxy type (not even archive type). If a user is trying to calculate a z-score across proxies, should throw a warning.

The problem is that proxy may have a positive or negative relationship with their common variables (e.g., coral d18O and coral Sr/Ca vs temperature) and so they may need to be flipped prior to calculating the z-score. This is done automatically when calibrating against instrumental records.

Preferred solution: the LiPD files have an interpretation field indicating the direction. We could use it to automatically figure out when to flip the axis. But it would require changes to the API to create a ProxyDatabase and the pickle file would no longer be valid.

Looping @CommonClimate in that discussion.

Remove assumption of temperature

In several places, the code assumes that users only want to reconstruct temperature, and that this variable is called 'tas' (e.g. in prep_graphem). This is an unnecessarily restrictive assumption, and may lead to bugs if people try to reconstruct any other field, or have other variable names.

Since cfr stands for "climate field reconstruction", I suggest calling the target variable "field" in graphem-related functions. It may also be good to check that the LMR part of the code does not assume too much either.

Support for ungridded observation data (e.g., GHCN)

The current version of the package only support gridded observation data for PSM calibrations.
In future, we should support ungridded (a collection of sites) obs data.

the nc_file must have one var and its name must be tar or pr?how can i use the rename_dic

find and remove proxy duplicates

the graphem bug with singular matrix illustrated the perils of having duplicates in the proxy matrix. I was originally thinking that it would only be an issue for graphem, and therefore should be dealt with within prep_graphem(), but I now believe it needs to be done earlier in the workflow.

Here is my proposal:

create a new method in the ProxyDatabase class called find_duplicates(), governed by a parameter called r_thresh (default = 0.9). Within that function, compute R = np.triu(np.corrcoef(proxy.T),k=1) (where proxy is the proxy matrix) and find the indices/labels of the records for which R > r_thresh.
offer the option to visualize those potential duplicates (by repurposing ProxyRecord.plot() and plotting the two close series in the same Axes object (different colors and/or line styles, whatever works best to tell them apart).
ask the user which ones they want to remove and add those indices/labels to a list
at the end, bundle those records into a "db_to_remove" instance of ProxyDatabase , so the user can subtract those proxies from the original database using the "-" syntax. (don't do it for them, though ; this must be an explicit part of the workflow so they can remember that they did it).

I believe this will be cleanest and most transparent, as the users will have to make careful, explicit decisions. Now that I think of it, we had to do this a lot as part of PAGES 2k ca 2015-2016, because several groups had included the same proxy series, or several slightly different versions of the same proxy. I bet this will be helpful for CoralHydro2k as well. And it will come in handy when merging two databases that have potential duplicates. So overall a very useful feature that will serve for both pseudo- and real proxy recons.

GraphEM improvements

Now that we've confirmed that the cfr implementation of GraphEM can run on non-pathological cases, it needs to be upgraded to the next level:

Cross-Validation

The choice of regression model in GraphEM (the graph) is still very unsatisfactory: whether the cutoff radius for a neighborhood graph or the target sparsities of a graphical LASSO ("glasso") graph, the only way to do it now is by trial & error which is unscientific, error-prone, and, frankly, a little embarrassing. We can do a lot better than that with cross-validation.

implement k-fold, block-style cross-validation over the instrumental period (this might entail retooling verif_stats to look at more than the average MSE over the field, for instance). Use the 1-sigma rule. k=5 by default.
cross-check with pre-instrumental data in pseudoproxy experiments

glasso capabilities

Neighborhood graphs are a quick and dirty way to get a reconstruction, but they underuse the available information. If enough data are available for calibration, glasso can do much better at extracting structure and capturing spatial dependencies. However, glasso is in need of the following updates:

defining reasonable sparsity levels for cross-validation.
allow for hybrid graphs where the climate field graph is obtained through glasso but the proxy-field graph can be neighborhood-based
compare gains to neighborhood graphs on standard test cases

temperature assumption

as in #2 , this code was written with the assumption that temperature is the only field of interest. Math stays the same, so changing the nomenclature won't change any numerical behavior, but I will still try to:

replace all mentions of "temperature" by "field"

Next level (optional)

explore the use of skggm, which uses the scikit-learn API to fit GGMs.
explore the family-wise error rate method to help choosing a sensible graph for the climate part of the covariance matrix, and maybe the climate-proxy part as well. (need Dominique's input)