Giter Club home page Giter Club logo

cfr's People

Contributors

commonclimate avatar fzhu2e avatar zilum avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

zilum kindredfff

cfr's Issues

name 'PyVSL' is not defined

when I run the step:
mdl.calibrate()
it cames some problem
name 'PyVSL' is not defined
I try to insatll PyVSL,but it seems not work

For reconstruction jobs, use an API to load the data instead of a pickle file.

The load_proxy method for the class ReconJob requires a pickle file (which I'm guessing has fixed keys). Would like an alternative as was done for ProxyDatabase().from_df, specifying the columns. Example:

pdb = cfr.ProxyDatabase().from_df(df, pid_column='dsname', lat_column='lat', lon_column='lon', time_column='timeval', value_column='val', proxy_type_column='proxyobs', archive_type_column='archive', value_name_column='varname', value_unit_column='varunits')

Re-use pyleoclim infrastructure

Currently, cfr only uses pyleoclim for spectral analysis. However, the ProxyRecord class is basically GeoSeries; only "seasonality' is missing.

Making ProxyRecord a child of GeoSeries would enable Pyleoclim functionalities, particularly:

  • pandas conversion with to_pandas()/from_pandas()
  • time unit conversions
  • standardize()
  • center()
  • bin()
  • and maybe more

Warning on z-score calculation for proxy composites

The proxy composite scores really only make sense by proxy type (not even archive type). If a user is trying to calculate a z-score across proxies, should throw a warning.

The problem is that proxy may have a positive or negative relationship with their common variables (e.g., coral d18O and coral Sr/Ca vs temperature) and so they may need to be flipped prior to calculating the z-score. This is done automatically when calibrating against instrumental records.

Preferred solution: the LiPD files have an interpretation field indicating the direction. We could use it to automatically figure out when to flip the axis. But it would require changes to the API to create a ProxyDatabase and the pickle file would no longer be valid.

Looping @CommonClimate in that discussion.

Remove assumption of temperature

In several places, the code assumes that users only want to reconstruct temperature, and that this variable is called 'tas' (e.g. in prep_graphem). This is an unnecessarily restrictive assumption, and may lead to bugs if people try to reconstruct any other field, or have other variable names.

Since cfr stands for "climate field reconstruction", I suggest calling the target variable "field" in graphem-related functions. It may also be good to check that the LMR part of the code does not assume too much either.

Suggested improvements for documentation

Hi @fzhu2e ,
I was going over the doc with Shreya (USC undergrad, who will be trying to apply the code to PAGES2k and maybe CoralHydro2k), and I noticed a couple of things:

  • many of the pages should start with a brief introduction linking to relevant concepts (e.g. papers) so newcomers like Shreya, or CoralHydro2k collaborators can be clear on terminology.
  • what is called "Climate" is really model priors (at least so far). Does your concept of Climate also encompass gridded, instrumental data? If yes, the example should show that. If not, then rename that category "ClimatePrior" (in your code and in the doc).
  • regardless, we need to think more broadly about to pull in instrumental data. Lazy-loading of remotely-hosted netCDF files via xarray could be a nice solution, though I don't know if it will work for all data products. Maybe @khider is aware of a catalog for those, from her MINT work?
  • the LMR workflow is nice, but ends on some fairly obscure diagnostics that few people outside of DA specialists will understand. At a minimum, it should include some ways to visualize the posterior (e.g. GMST ensemble timeseries, map of Little Ice Age cooling).
  • It is also important to show how to compare prior and posterior, since that is a key part of working with DA, that too few people presently do. Having the code capabilities to do it, and an example of how it's done, will go a very long way, I think. A good example of this was done in Singh et al 2019, Fig 4 . In general, showing how spatial relations in the prior (e.g. covariance with a field, or across two fields) are affected by assimilating proxy informations would be an excellent idea.

Just a few ideas to operationalize this great package. Keep your audience is mind!

find and remove proxy duplicates

the graphem bug with singular matrix illustrated the perils of having duplicates in the proxy matrix. I was originally thinking that it would only be an issue for graphem, and therefore should be dealt with within prep_graphem(), but I now believe it needs to be done earlier in the workflow.

Here is my proposal:

  • create a new method in the ProxyDatabase class called find_duplicates(), governed by a parameter called r_thresh (default = 0.9). Within that function, compute R = np.triu(np.corrcoef(proxy.T),k=1) (where proxy is the proxy matrix) and find the indices/labels of the records for which R > r_thresh.
  • offer the option to visualize those potential duplicates (by repurposing ProxyRecord.plot() and plotting the two close series in the same Axes object (different colors and/or line styles, whatever works best to tell them apart).
  • ask the user which ones they want to remove and add those indices/labels to a list
  • at the end, bundle those records into a "db_to_remove" instance of ProxyDatabase , so the user can subtract those proxies from the original database using the "-" syntax. (don't do it for them, though ; this must be an explicit part of the workflow so they can remember that they did it).

I believe this will be cleanest and most transparent, as the users will have to make careful, explicit decisions. Now that I think of it, we had to do this a lot as part of PAGES 2k ca 2015-2016, because several groups had included the same proxy series, or several slightly different versions of the same proxy. I bet this will be helpful for CoralHydro2k as well. And it will come in handy when merging two databases that have potential duplicates. So overall a very useful feature that will serve for both pseudo- and real proxy recons.

GraphEM improvements

Now that we've confirmed that the cfr implementation of GraphEM can run on non-pathological cases, it needs to be upgraded to the next level:

Cross-Validation

The choice of regression model in GraphEM (the graph) is still very unsatisfactory: whether the cutoff radius for a neighborhood graph or the target sparsities of a graphical LASSO ("glasso") graph, the only way to do it now is by trial & error which is unscientific, error-prone, and, frankly, a little embarrassing. We can do a lot better than that with cross-validation.

  • implement k-fold, block-style cross-validation over the instrumental period (this might entail retooling verif_stats to look at more than the average MSE over the field, for instance). Use the 1-sigma rule. k=5 by default.
  • cross-check with pre-instrumental data in pseudoproxy experiments

glasso capabilities

Neighborhood graphs are a quick and dirty way to get a reconstruction, but they underuse the available information. If enough data are available for calibration, glasso can do much better at extracting structure and capturing spatial dependencies. However, glasso is in need of the following updates:

  • defining reasonable sparsity levels for cross-validation.
  • allow for hybrid graphs where the climate field graph is obtained through glasso but the proxy-field graph can be neighborhood-based
  • compare gains to neighborhood graphs on standard test cases

temperature assumption

as in #2 , this code was written with the assumption that temperature is the only field of interest. Math stays the same, so changing the nomenclature won't change any numerical behavior, but I will still try to:

  • replace all mentions of "temperature" by "field"

Next level (optional)

  • explore the use of skggm, which uses the scikit-learn API to fit GGMs.
  • explore the family-wise error rate method to help choosing a sensible graph for the climate part of the covariance matrix, and maybe the climate-proxy part as well. (need Dominique's input)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.