spencerahill / aospy Goto Github PK

Python package for automated analysis and management of gridded climate data

License: Apache License 2.0

Python 77.51% Jupyter Notebook 22.49%

aospy's Introduction

aospy: automated climate data analysis and management

aospy is a Python-based tool for automating computations involving gridded climate data and the management of the results of those computations. Use it to accelerate your science by automating your climate data workflow.

Please visit our documentation for more detailed information: what it does, how to install, how to use, how it works, etc.: http://aospy.readthedocs.io

Quickstart

Install via conda :

conda install -c conda-forge aospy

And that's it! We're also available via pip: pip install aospy. Then checkout the official documentation for instructions on how to get started.

Troubleshooting

Questions of any kind are welcome and best placed as Issues on our official Github repo.

Please don't hesitate to ask a question, especially if you're new to the package and/or Python! We are eager to help people get started using aospy.

Copyright and license

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

aospy's People

Contributors

Stargazers

Watchers

Forkers

spencerkclark kaituozhe528 micahkim23 dacoex petertfs pslota rvalenzuelar liumenglife wblumberg haydenbetts sourcery-ai-bot

aospy's Issues

Backend for to_netcdf

@spencerkclark, could you tell (remind?) me of the motivation behind using the engine=scipy specification for our netcdf I/O in calc.py? Just curious.

Overflow in time averaging

Due to the fact that within xray.DataArray objects, timedeltas are always stored as timedelta64[ns] (regardless of whether you instantiate the DataArray with timedelta64[D] or timedelta64[Y] objects), they tend to be very large numbers. In particular, when grouping by years and summing, these numbers are on the order of 10^16. Even for relatively small numbered-data (like temperature, which is order 100) this can lead to overflow issues (which go undetected until you look at the results of a computation; you end up getting negative values for weighted averages).

Here's a minimal working example of the problem:

In [1]: import xray
In [2]: import numpy as np
In [3]: year = xray.DataArray(np.timedelta64(1,'Y'))
In [4]: print(year)
<xray.DataArray ()>
numpy.timedelta64(31556952000000000,'ns')

# For small numbers this works.
In [5]: year * 100
Out[5]:
<xray.DataArray ()>
numpy.timedelta64(3155695200000000000,'ns')

# For larger numbers things turn negative.  This is not good, because if you 
# were taking the annual average at a gridbox where the average value was 300
# you would have a number like this in the numerator within _to_yearly_ts().
In [6]: year * 300
Out[6]:
<xray.DataArray ()>
numpy.timedelta64(-8979658473709551616,'ns')

An easy fix is to simply scale and convert the dt array such that instead of being comprised of timedelta64[ns] objects, it is instead comprised of float objects with the value represented by the timedelta, but in number of days rather than nanoseconds. The main issue I see with this solution is that if there were ever some place where we needed a particular feature of a timedelta (other than its number), then we wouldn't have it, but since the values would still be in a xray.DataArray they would still be linked with the time coordinate.

Is there any need for the functionality of the timedeltas once they exist, or can we just convert dt to being a DataArray of float objects?

A solution within _to_yearly_ts could look like

def _to_yearly_ts(self, arr, dt):
        """Average a sub-yearly time-series over each year."""
        if isinstance(dt, int):
            dt_by_year = len(arr.groupby('time.year'))
        else:
            # Convert from ns to days (prevent overflow)
            dt.values = dt.values.astype('timedelta64[D]').astype('float')
            dt_by_year = dt.groupby('time.year').sum('time')  
        arr = arr*dt
        return arr.groupby('time.year').sum('time') / dt_by_year

but perhaps we could just change self.dt at the very beginning when we bring it in (say in _create_input_data_obj?).

`FigLayout` class to save panel spacing and arrangement

Would include figure size, row width, column size, number of rows, columns, ax limits, colorbar location and limits, map projection type, etc. Can be passed to plotting routines. Would enable easy replicating of multi-panel plots.

Could be used to create default layouts also, i.e. one per (# row x # column) arrangement. (I don't like the spacing matplotlib does by default, which is why this is worth bothering with at all).

Extend fix for years less than 1678 for non-"days since" time units

Currently our workaround for dates with years less than 1678 makes the assumption that the original units of the time dimension in the read-in file were "days since ...". As part of the workaround we hard-code the units to be "days since 1900-01-01 00:00:00" (see https://github.com/spencerahill/aospy/blob/develop/aospy/calc.py#L417-L418) and then decode the CF properties. This causes problems for high-frequency output, because while the units of the time dimension in the original file might be "hours since ...", they get changed to "days since 1900-01-01 00:00:00". This means that xarray interprets the time interval between outputs as days (rather than hours in this example); e.g. 2920 3-hourly outputs get interpreted to be 24 years (rather than one year).

A straightforward fix would be to allow the user to specify another argument when setting up a calc to specify the time units; that being said I'm not sure how keen we are in adding more arguments to that interface.
It also might be possible to do this automatically (by searching for one of ['hours', 'days'] in the units string of the original file). I would lean towards this solution, with sufficient testing.

For the time being (since I don't switch back and forth between doing high-frequency and low-frequency analysis too often) it's fine for me to correct this by changing the hard-coding (e.g. change the time units to "hours since 1900-01-01 00:00:00"), but I definitely think this is something that needs to be addressed eventually (although hopefully at some point this gets fixed upstream, so we don't have to use a workaround to begin with).

utils.monthly_mean_ts includes unwanted months if # years > 1

E.g. if only JAS desired and two years of input data, then resulting time array will have 15 values: July through December of year 1 and January through September of year 2. Intended result is 6 values, July-September of each year.

Looks easy enough to fix; working on it now.

Allow user to specify analysis period as a range of dates, rather than years

Currently when constructing a Run object the user must specify a set of years that the model is run through as well as the analysis period. For example:

am2_control = Run(
    name='am2_control',
    description=(
        'Preindustrial control simulation.'
        ),
    direc_nc=('/path/to/files'),
    nc_dur=100,
    nc_start_yr=1,
    nc_end_yr=100,
    default_yr_range=(21,100)
)

This works when the analysis period and Run align nicely with years, but prevents any finer control. Perhaps a solution that would provide greater flexibility would be to allow the user to specify the default_time_range. For the above example it would be something along the lines of:

from netCDF4.netcdftime import datetime

am2_control = Run(
    name='am2_control',
    description=(
        'Preindustrial control simulation.'
        ),
    direc_nc=('/path/to/files'),
    nc_dur=100,
    nc_start_yr=1,
    nc_end_yr=100,
    default_time_range=(datetime(21, 1, 1), datetime(100, 12, 31))
)

Here instead of numpy.datetime64 objects we can use netCDF4.netcdftime.datetime objects which allow for years less than 1678. For idealized cases with no natural calendar (which aospy currently treats as having a 360_day calendar) one could simply compute the start and end dates using the netCDF4.netcdftime.num2date function based on the number of days since the start of the run.

AM3 lon_bounds, lat_bounds not being read from grid files

(Most recent commit is cda8fe7, although this bug no doubt stems from changes in a prior commit)

One of the grid files is: /archive/Spencer.Hill/am3/am3clim_hurrell/gfdl.ncrc2-intel-prod-openmp/pp/atmos/atmos.static.nc. This netCDF file has lat_bnds and lon_bnds variables, and those strings are in the list of strings to check for, but for some reason they aren't being found by Model._get_grid_attr.

Abstractify methods for finding data files saved on disk

(Below is copied from @spencerkclark 's comment on #31 )

While currently within Calc the two methods accomplish the same tasks, in an abstract sense the current ..._one_dir and ..._gfdl read-in methods are actually quite different:

..._one_dir requires the user to map every variable to a file name explicitly. If the variable does not appear in this map, aospy will not even attempt to look for it.
..._gfdl is an implicit system. The mapping is coded into the method which looks for the files within the post-processing file structure. The user is not required a priori to specify which variables are in which files, and thus aospy is allowed to look for variables that may not exist.

I would argue that the explicit read-in method is the most general way of doing things. With enough information, one could automatically generate an explicit file map from an implicit generator. In addition, there is nothing that says you couldn't relax the current single directory constraint and just map each variable (within a particular time frequency) to a full file path.

To continue to support implicit read-in methods (for very structured output data, like ..._gfdl) you could require that a user create some object that implements an interface (call the interface FileMapGenerator?) to include methods to generate a map to files for a particular variable when given the intvl_in, variable name, data_in_dur etc. as arguments. The source code for these objects could be stored in a user's aospy_user directory.

Within a Run object one could then have a single argument for the file read-in method. The user could pass either the explicit dictionary mapping or they could pass an object that implements the FileMapGenerator interface. Within Calc, when reading in the files, you could have some simple logic that would be along the lines of: "if an explicit map is provided use the map; if not, use information about which variable you are looking for, and the interval in etc. and pass those as arguments to the generator, which would return an explicit map for just that variable." Using an interface would ensure that the explicit file map generated would always have the same structure (so that it could be used seamlessly within Calc).

Calc not working for AM3 due to 2-D lat/lon bound arrays

Results of print(ds) added to Calc._add_grid_attributes immediately before return(ds):

<xray.Dataset>
Dimensions:     (bnds: 2, lat: 90, level: 23, lon: 144, pfull: 48, phalf: 49, time: 360)
Coordinates:
  * lat         (lat) float64 -89.0 -87.0 -85.0 -83.0 -81.0 -79.0 -77.0 ...
  * lon         (lon) float64 1.25 3.75 6.25 8.75 11.25 13.75 16.25 18.75 ...
  * bnds        (bnds) int64 0 1
  * time        (time) datetime64[ns] 1981-01-16T12:00:00 1981-02-15 ...
    lat_bounds  (lat, bnds) float64 -90.0 -88.0 -88.0 -86.0 -86.0 -84.0 ...
  * phalf       (phalf) float64 0.01 0.02697 0.05171 0.08895 0.1425 0.2207 ...
    sfc_area    (lon, bnds, lat) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    bk          (phalf) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
  * pfull       (pfull) float64 0.01711 0.03801 0.06866 0.1136 0.1788 0.2744 ...
    land_mask   (lat, lon) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ...
    lon_bounds  (lon, bnds) float64 0.0 2.5 2.5 5.0 5.0 7.5 7.5 10.0 10.0 ...
    zsurf       (lat, lon) float64 2.851e+03 2.851e+03 2.851e+03 2.851e+03 ...
  * level       (level) float32 100000.0 92500.0 85000.0 77500.0 70000.0 ...
    pk          (phalf) float64 1.0 2.697 5.171 8.895 14.25 22.07 33.61 ...
Data variables:
    t_surf      (time, lat, lon) float64 243.8 243.7 243.7 243.6 243.5 243.4 ...
    average_DT  (time) timedelta64[ns] 31 days 28 days 31 days 30 days ...
    lat_bnds    (time, lat, bnds) float64 -90.0 -88.0 -88.0 -86.0 -86.0 ...
    lon_bnds    (time, lon, bnds) float64 0.0 2.5 2.5 5.0 5.0 7.5 7.5 10.0 ...
Attributes:

Note the presence of both lat_bnds and lon_bnds as variables and lat_bounds and lon_bounds as coordinates, and that the former pair incorrectly have time as their first coordinate.

This ultimately yields an error in the regional calculations:

/home/Spencer.Hill/py/main.py in <module>()
     43
     44 if __name__ == '__main__':
---> 45     calcs = aospy_user.main(mp)

/home/Spencer.Hill/py/aospy_user/aospy_user/main.pyc in main(main_params)
    176         return
    177     param_combos = cs.create_params_all_calcs()
--> 178     calcs = cs.create_calcs(param_combos, exec_calcs=True, print_table=True)
    179     return calcs

/home/Spencer.Hill/py/aospy_user/aospy_user/main.pyc in create_calcs(self, param_combos, exec_calcs, print_table)
    142                 if exec_calcs:
    143                     try:
--> 144                         calc.compute()
    145                     except:
    146                         raise

/home/Spencer.Hill/py/aospy/aospy/calc.pyc in compute(self)
    695                     data = full_ts
    696                 if 'reg' in specs:
--> 697                     reduced.update({reduc: self.region_calcs(data, func)})
    698                 else:
    699                     reduced.update({reduc: self._time_reduce(data, func)})

/home/Spencer.Hill/py/aospy/aospy/calc.pyc in region_calcs(self, arr, func, n)
    642             else:
    643                 method = getattr(reg, func)
--> 644                 data_out = method(arr, self.model[n])
    645             reg_dat.update({reg.name: data_out})
    646         return reg_dat

/home/Spencer.Hill/py/aospy/aospy/region.pyc in av(self, data, model)
     74     def av(self, data, model):
     75         """Time average of region-average time-series."""
---> 76         ts_ = self.ts(data, model)
     77         if 'year' not in ts_.coords:
     78             return ts_

/home/Spencer.Hill/py/aospy/aospy/region.pyc in ts(self, data, model)
     67         dims = ['lat', 'lon']
     68         coords = [data_masked.coords[c] for c in dims]
---> 69         sfc_area = data.sfc_area
     70         land_mask = data.land_mask
     71         weights = (self.mask_var(sfc_area)*land_mask).sum('lat').sum('lon')

/home/s1h/anaconda/lib/python2.7/site-packages/xray/core/common.pyc in __getattr__(self, name)
    135                     pass
    136         raise AttributeError("%r object has no attribute %r" %
--> 137                              (type(self).__name__, name))
    138
    139     def __dir__(self):

AttributeError: 'DataArray' object has no attribute 'sfc_area'

Support user-specified map/dict of directory structure.

The GFDL directory structure logic I use is specific to my own naming conventions, i.e. atmos_level_inst for instantaneous atmospheric data on model-native coordinates. However, not all people use these same names -- they can be named basically arbitrarily in the simulation's XML. Therefore, the logic for traversing a GFDL directory of model output should support user-supplied names.

The good thing is that, besides the names of the components, the structure is pretty much locked in, meaning that only the strings would have to be modified. The specification of these names could potentially vary from run to run within a model, and so it should be supported to be specified at the Run, Model, or Proj level. E.g. `component_names={'p.inst': 'atmos_inst', 'eta.inst': 'atmos_level_inst'}. This needs to be thought out further.

Improve unit conversion methods

C.f. #54. The existing Units class could potentially be harnessed here.

Enable datatype specifications for each input variable

If a variable is a function of multiple variables, the user should be able to designate input datatypes for each one individually. This has come up for me when computing budget residuals: I need 3 hourly instantaneous data to compute tendencies, but some of the other terms in the budget were only output as monthly.

Need to think through more carefully, but I think the CalcInterface class could be a starting point for this. Essentially, each Var inputted to a Calc would need to be given the misc. dtype attributes. Would also need to figure out how the user specifies these.

Supposing #3 was ultimately implemented, the dtype specifications of a var would then be copied down to those needed to compute it, if that var itself isn't found on disk.

Outputted netCDF files not readable by ncdump, ncopen command line tools

For example:

ncdump -h /work/s1h/aero_3agcm/am2/reyoi_cont/t_surf/t_surf.jja.std.from_monthly_ts.am2.reyoi_cont.1983-1984.nc
*** ncdump: ncopen failed on /work/s1h/aero_3agcm/am2/reyoi_cont/t_surf/t_surf.jja.std.from_monthly_ts.am2.reyoi_cont.1983-1984.nc

The official UCAR documentation is hard to read (as in a couple minutes of googling didn't get me anywhere), so it's not clear to me what exactly our files are missing or doing wrong. Interesting that those command line tools don't work whereas netCDF4-python and xarray do.

I am labeling this a bug, because I assumed previously that this would work, and ultimately it should. We should add some tests for this; for starters we could just the example file I used here.

Enable pre-computation 'adjustment' step.

There are adjustments that have to be made to, e.g. horizontal wind in order to conserved budgets, that yield a mismatch in the time frequency of the adjusted data and the rest. For example, to compute the time-mean component of column mass flux divergence, I have to apply a column mass adjustment to the horizontal wind (which requires other variables besides u and v) at each high-frequency timestep, then compute the flux divergence, then re-compute it after first time-averaging the adjusted wind and other fields, then subtract off this time-mean from the full field. Compare this to the built-in time-mean logic in Calc: take monthly averages of everything before passing it to the function to be computed.

So there should be an adjustment step wherein this is done beforehand, and that way the data can be broadcast back to the desired time range. Right now, I have to create functions that are specific to time mean, full, or eddy data, and can't utilize the 'time-mean.av', 'eddy.av' etc. functionality within Calc.

Need to think through how to implement this. This is part of a larger overhaul in order to support chained calculations, e.g. #3.

Retain data loaded from disk for multiple Calcs in a CalcSuite, rather than reloading each time

Determines what the needed arrays and time subsets are for each computation, and then holds on to a copy whenever a subsequent Calc needs one identical to (or a subset of) a pre-loaded one
So would keep the loading functions within the Calc class, but would have the Calc check to see if the CalcSuite has the data first. OR that check is done within CalcSuite -- if it doesn't have the data already, it calls on the Calc's loading functions to get the data.

Currently, the netCDF data is loaded and freed with each Calc, even if all of them require the same data, which slows things down a lot, especially for larger computations.

Plot profiles of model-native vertical coordinate data to correct levels, not reference levels

Currently, plotting a profile of data that is defined on model-native vertical coordinates results in the reference pfull (or phalf) array being used for the pressure value of each level. Unless the region of interest has a surface pressure that matches that used for the reference array, namely 1013.25 hPa, these pressure values are incorrect for the data being plotted. For example, I just created some profiles for a region with a surface pressure of 960 hPa, but its lowest level is (incorrectly) plotted to be at 1013.25 hPa.

The plotting routines should therefore use the actual pressure values for that data, which requires knowledge of the surface pressure.

Different vertical output types should be handled within same `Calc` object

Currently, a new Calc is created for each one, and thus the data is loaded and released unnecessarily the second time.
Both require the exact same data and thus its inefficient to have them as separate Calcs.
In contrast, 'sigma' and 'pressure' dtype_in_vert should be separate.
So it boils down to making dtype_out_vert behave like dtype_out_time does rather than the rest.
This could be addressed during implementation of CalcSuite class, since in that case CalcSuite will load and hold all the necessary data.

Issue warning when time coord varies among arrays used in a calculation

On multiple occasions I have been stung by, within a calculation function in aospy-obj-lib, accidentally not applying the same time transform to all of the variables. For example, for 3 hourly data that I average monthly within the function, I accidentally forget to apply that average to one variable only.

This bug can be hard to track down, because with xarray's powerful broadcasting, this usually doesn't cause the function to crash -- instead it computes and silently gives the wrong answer (from a physical standpoint).

This would be easier to catch if a warning was issued when there was a mismatch in the time coord between arrays. But how to raise this seems very tricky -- when/by what is the check performed? If a Var's function calls another function, it could be done at that point (maybe a decorator on the interior function).

Need to think more about, but getting this to eventually work somehow would be really handy.

Add testing data to aospy

C.f. discussion in #61. We should include some test data within aospy; xarray does so -- I imagine its actually a pretty standard practice. It's not good that our tests are tied to data on GFDL's archive, in terms of eventual growth (not to mention not being able to run them now at times when we're neither there nor ssh'ed)

The catch is that our files are large -- I think we would need to coarsen some model output first or even just create synthetic data...even a few MB is too large IMO.

Inherit an object's grid info if passed it as an argument during init

C.f. the comments in #14. This should be permitted at the Run, Model, or Proj level.

Create testing suite of cases for main computation loop

Unit tests are badly needed pretty much everywhere. But in particular, a suite of test-cases should be implemented for the main Calc.compute that span all of the possible permutations of input and ouput specifications. Reason being: frequently when modifying the Calc class, out of laziness I just do an ad-hoc test using a single permutation of the myriad data input/output options...and then when I need to use a different combination, I come to find that the changes I implemented broke the code for this particular case.

A few in particular that come to mind (additions are welcome; I will update this as they come up):

Input

Output

Vertically integrate and don't, attempted on both vertically defined and not vertically defined
Functions that use xray resampling on time coordinate (e.g. time tendencies).
ts, av, and std on lat-lon data
ts, av, and std on regional averages

Use logging builtin package to handle diagnostic info, stderr, etc

The Python 2.3+ standard libraries feature the logging package. This should be used to handle all logging-like features, rather than the hackish methods I have implemented to date in aospy that just print things to the terminal.

(This is part of a larger effort to better utilize the Python standard library.)

Resolving problems stemming from xray -> xarray transition

A few problems for aospy have come up relating to the xray to xarray transition.

Can't rename a dimension to it's existing name: pydata/xarray#724
The new xarray data model causes our aospy.utils.replace_coord function to break: pydata/xarray#725

Both of them have workarounds that I've implemented in 38d2bcb. The workaround for the latter is especially untenable in the long run, because the metadata for phalf gets kept for a pfull array. So I'll leave this open to remind us to resolve them once they get fixed on the xarray side (or if we keep up with something smarter in the meantime).

Make Constant a subclass of Var

A constant, after all, is just a particular kind of data variable. This would also enable the Calc._get_input_data method to be cleaned up (along with the switch from string labels to Var objects for p and dp in #14 ). The existing Constant objects defined in constants.py will also need the Var init arguments added: def_time, def_vert, etc.

Var.vars should accept Vars that are themselves functions of multiple Vars

Currently, if Var is passed a value for func when instantiated, it then populates the vars attribute with the Var objects that are passed into the given function. But this is only supported for one level: the Var objects passed to the function must be stand-alone, i.e. don't require a function of their own.

The result is the unnecessary proliferation of functions. For example, my flux divergence calculations currently have MSE, DSE, c_pT, gz, and general versions. Each one takes as arguments all the arguments required for the pre-calculation (e.g. temp, hght, and sphum for MSE), computes it by a call to that function, then calls the general function with that output as the argument.

Possible solution: when a Calc object encounters a Var with nested functions such as this, it should, from bottom up, create Calc objects that returns the full timeseries of the specified function. At the next level up, i.e. which is taking in this computed data as one of its Var objects, the code that loads data from netCDF will have to be bypassed, since you already have the timeseries in hand that you want.

Potentially adopt Unidata Common Data Model specifications

xarray notes in the front page of their docs that they adhere to the Unidata's Common Data Model (CDM). I'm not familiar with CDM's precise specifications (haven't read the page I linked to yet). But I'm wondering if this is something aospy should ultimately strive for, for a few reasons:

Given our dependence on xarray, it's likely that we're not far from CDM compliance already.
It would add in people's confidence in using aospy if it is known to follow this standard.
It would be in the spirit of harnessing other open source/community efforts.

This is to start a discussion. If we decide to pursue this, we can then get into details of implementation. First step (for me at least) is becoming more familiar with the CDM specs.

Use xarray.Dataset as way of saving vector quantities

At least before the migration to xray, aospy required that each user function return a scalar field, e.g. for horizontal wind one would compute/save the u and v components separately. However, the xray.Dataset object can nicely contain multiple variables, such that both components could be saved in a single file.

This might largely work out of the box, because most methods when applied to a Dataset are simply applied to all of the variables contained therein. I.e. the subsequent time averaging, vertical/regional averaging, etc. might not even require special logic for this to work.

But c.f. #14 , if we were to start saving other grid data that gets represented as variables (e.g. land_mask, sfc_area, ak, or bk), then this would become trickier. Perhaps a solution in that case would be to add the grid variables before passing to the user function, and then strip them off from the output (although the regional calculations would make use of them...so maybe would pass them as a separate Dataset to the regional calculations)

Suppress local timezone representation when printing times in aospy

>>> ds=xray.open_dataset('/archive/Spencer.Hill/am2/am2clim_hurrell/gfdl.ncrc2-default-prod/pp/atmos_level_inst/ts/3hr/1yr/atmos_level_inst.1982010100-1982123123.divg.nc')

>>> print(ds.time[0])
<xray.DataArray 'time' ()>
numpy.datetime64('1981-12-31T22:00:00.000000000-0500')
Coordinates:
    time     datetime64[ns] 1982-01-01T03:00:00
...

Notice that the time array values and the coordinate values differ by 5 hours. If you inspect the netCDF file directly via, e.g. ncdump -c, the time values start at 0.125, with units "days since 1982-01-01 00:00:00", i.e. 1982-01-01T03:00, consistent with the time coordinate value. This comes from the timezone offset that's being applied -- notice the '-0500' at the end of the array values. According to the numpy.datetime documentation, the default is to put things in terms of the local timezone, which for the U.S. Eastern Daylight Time (where I am) is -5 hours from UTC ("Zulu") time.

To me it's pretty confusing to have this timezone applied. I can't see any case where an aospy user would want their dates expressed in terms of a local timezone. Can it be suppressed? I found this Stack Overflow question, which seems relevant.

Ultimately, this doesn't affect the computations -- it only affects how the values are printed. But it caused me a lot of confusion and likely would do so for others, and so should be addressed.

Create `DataType` class and sub-classes.

This class would be sub-classed for different dimensions: TimeDataType, VertDataType, RegionDataType.

TimeDataType :: E.g. 'ts', 'inst', 'av' that we currently use.
VertDataType :: E.g. 'vert_int', False that we currently use.
RegionDataType :: E.g. 'reg' that we currently use.

So the strings and other primitives we currently used would be replaced by objects from these classes. At least initially, they need not be anything other than a container for that primitive. But re-factoring in terms of classes will make adding any future logic or extensions much easier to do.

Ideally, when #45 is implemented it would be in terms of these classes (i.e. this would be done before then).

Calc not correcting GFDL instantaneous data time for p or dp

Calc._get_input_data never calls the Calc._correct_gfdl_inst_time method on p or dp data, because the return statement within the elif var.name in ('p', 'dp'): block prevents those variable's data from ever reaching that part of the method.

phalf_from_pfull should interpolate to TOA and surface values if none specified

The aospy.utils.phalf_from_pfull function currently requires that the top-of-atmosphere and surface values be supplied as kwargs. It should also support extrapolation, since many variables don't have a priori constraints on their values at these upper and lower limits.

Land mask not being applied to regional calculations

For example, global, global-land, and global-ocean regions ('globe', 'land', and 'ocean' in my obj-lib) are all identical. Not sure for how long this has been the case.

(Good example of why not having automated tests causes problems.)

Create 'templates' directory w/ example project declaration, etc.

Would go in the top level directory, i.e. alongside docs/, aospy/, setup.py, etc.

So that user has an example/template for how to create a Proj object. Part of larger effort to remove user-specific things from the codebase itself.

Could also potentially include templates for creating Region objects, Calc objects, and Var objects, assuming those objects also get moved outside of the main codebase.

Bug: to_radians applied too late in model.py grid_sfc_area?

I hadn't pulled from the develop branch in a while, but I just did this morning. In doing some calculations involving surface area I noticed some things were off. In printing a sum of the surface area returned I got a value of:

<xarray.DataArray 'sfc_area' ()>
array(2.9224513997287056e+16)

This is too large by a factor of 180.0 / pi, hinting that it stems from not converting from degrees to radians.

I think the issue stems from line 194 in model.py:

dlon = to_radians(cls.diff_bounds(lon_bounds, lon))

In my case, but I think in most cases, the width of gridboxes in longitude in degrees is less than 4 pi. This causes utils.to_radians to pass the widths back unchanged. I think a solution here would be to just apply to_radians to lon_bounds (instead of after diff_bounds):

dlon = cls.diff_bounds(to_radians(lon_bounds), lon)

Luckily this shouldn't change any results from area averaging (since it's off by a constant factor), but important to get right nonetheless.

Get continuous integration tool(s) properly configured

I'm pretty sure that the Travis-CI tool is improperly configured to the point of being meaningless. It certainly doesn't run our test suite -- otherwise it would report the failures related to #50.

This is likely straightforward enough to fix.

Create temporary end-to-end suite of test cases

In lieu of a robust set of unit tests (see #22), which will take time to develop, we should have a set of aospy objects (and associated computations) that span the current horizon of the package's uses. Ideally we'd be able to run a script that would attempt to preform all the specified computations on the set of objects; if it ran successfully we'd at least have some confidence that it was doing what we wanted (or we'd be able to see what tripped it up).

The details are still a bit up in the air (we'd have to figure out how to get things to run with one click), but for now I suggest we create a Project that contains Runs from some of the Models we have worked with:

We'd then have to come up with a set of computations to do on each of them. To start out I think these can just be computations that aospy can do without a call to an external aospy_user/calcs/ function (although I guess there is no reason we couldn't have some dummy functions within this pseudo-aospy_user environment). (This was addressed in #61, which now does allow for calls to external calculation functions located within test_objs/calcs.py). The most basic computations would be (feel free to add more as they come up):

Annual mean, and time series
Seasonal mean, and time series
Monthly mean, and time series
Vertical integral over interpolated pressure levels
Vertical integral over sigma pressure surfaces
Regional average, and time series (with no land mask)
Regional average, and time series (with land mask)
Regional average, and time series (with ocean mask)

It would also be great if we could have some reference computations to compare to for these calculations, but that too might be a goal for down the road. Initially it can just answer the question "does it run or not?"

Remove obsolete Constant and Units classes

In particular, multiplying a Constant object with an xray object results in a numpy array, rather than an xray object as desired. This is just one example of multiple problems with the operator overloads for the Constant class...I didn't really know what I was doing when I wrote them! So all of them need to be inspected, likely re-written, and tested.

As a workaround until then, one can just use the value attribute, e.g. c_p.value rather than just c_p.

Data saving to /archive not working

By chance, I just noticed that saving to archive has not been happening (for my runs at least) for quite some time -- none of the data.tar files of mine on /archive have been updated since this summer. So the source of this probably originates from around then (the latest date I have is August 27).

@spencerkclark, is this the case for you too? It is possible I modified something within my aospy-obj-lib, and aospy itself is working fine.

I/O to scratch is working fine (which is why I didn't notice this for so long!).

Surface area computation wrong when bounds arrays are 2-D

I.e. for AM3, HiRAM, or c48-HiRAM.

Values at poles are zero, and likely they are incorrect in other places also.

Part of larger issue of the 2-D lat/lon bounds causing lots of problems: see #39 (you can see the 0.0 surface area values there too).

Mask unphysical values based on gradients, not variable-specific valid range

Post-processing of model data onto pressure coordinates over topography can lead to unphysical values in the resulting arrays that are not masked as being underground. Currently the Var class has a method implemented, mask_unphysical, that uses the attribute valid_range to mask those values outside this range.

This is an imperfect solution. Considering the entire globe and the depth of the atmosphere, some variables span such a wide range of values that the unphysical values are actually within this range. In such a scenario, masking the unphysical values will also inadvertently mask physically valid values in some locations.

Regardless of the absolute values, the unphysical values show up essentially as discontinuities in the horizontal gradient of the field. For example, the difference between values over northern Africa in zonally adjacent cells at 925 hPa of moist static energy are all near 1% or lower, except for at one of these unphysical points where the difference is 8%. So the idea would be to exploit this characteristic as the masking mechanism. One caveat is that some fields inherently have very sharp gradients, such that a method such as this could be ill-posed.

Also, the unphysical values due to topography only occur in the levels to which topography extends (which is controlled by Tibet & the Himalayas), which the current implementation doesn't take into account but could. They do not necessarily occur directly next to gridpoints that are masked for topography, otherwise it would be much easier to find them.

Enable grid attributes to be defined at the Run level

C.f. this comment in #14 and the comments in #23. This would enable runs that should, conceptually, belong to the same Model object, but that have different grids for whatever reason (different postprocessing choice, change in model code that slightly alters coordinates), to indeed have the same parent Model.

Math string methods/class for creating strings of variable names

Already I have populated ~10% of my Var objects with a math_str attribute, which consists of a raw string with LaTeX formatting of that variable's mathematical expression. It remains to fully implement this:

Methods or keywords (e.g. var_math_str) that can be passed to aospy.plotting classes as kwargs, which then ultimately get replaced with the desired string.
Add overlines and eddy ticks as appropriate given time averaging of input data & of output.
Add curly braces automatically when data loaded is vertically averaged.

Solution needed for Model-dependent Calc functions

An issue I've encountered, particularly for more involved calculations which involve many variables, is that the way the calculation needs to be carried out varies depending on the model used. For a concrete example take computing the net column heating due to radiative, sensible, and latent heat fluxes. In a comprehensive GCM (like AM2) there is a term due to reflected shortwave radiation (swup_toa); this does not exist in the gray-atmosphere idealized moist model. In addition there are differences in how to handle the evaporative heat flux from the surface in each Model. Currently I address this by creating two separate calc functions and two separate variables linked with each calc function. This is rather inefficient, and I end up with two differently named variables representing the same physical quantity.

Another somewhat related situation that I encounter is that in some cases a variable is already computed in the model, while in others a variable needs to be computed from other variables. This is especially tricky to deal with when that variable is involved in another function; again I need to create separate functions and variables for each Model.

I'm not aware of any current (or numpy-legacy) infrastructure that is (was) in place that works to address this issue. I don't have a particularly good solution to either of these problems in mind at the moment -- I'll add one here if I think of one, but I just wanted to raise this issue while we are thinking about the next refactor of the object library.

Refactor main script

Currently there are separate classes to parse the arguments in the main routine for calculations and the main routine for plotting, and lots of methods embedded within plotting classes. The parsing functions within the plotting classes should be refactored out -- the plotting classes shouldn't be responsible for parsing user input, only using the results to generate the desired plots.

There should be a single class if possible, or a single parent class that the others inherit from.

Although it needs refactoring as described above, the functionality of the parsing for plotting is more advanced than for the main calculation routine. The latter just permutes all of the specified inputs, attempting to perform a calculation for every possible combination. The former permits passing arguments to specific panels or plots, or across all elements, among other possibilities (although the syntax is pretty confusing).

Implement xray objects in place of numpy arrays

Manual manipulation of axes in many different places has become a nuisance.
Need to standardize array shapes and handle them all the same.
Trick is making sure there's no ambiguity based on shapes of what dimensions remain, etc.
All functions should expect arrays with dimensions [time, vert, lat, lon]
But they should also be able to handle lower dimension arrays if physically valid. For example, moist static energy could be computed for any input data dimensions; it need not be time, vert, lat, or lon defined.
Continuing with this example, based on shape alone, it's ambiguous which dimensions are included.
Therefore need to encode in the dimension information.
So create a class, =AosArray=, that is just a numpy or numpy.ma array but with an extra attribute specifying the name of each dimension.
This is essentially copying the functionality of netCDF, but I think switching to full netCDF objects is overkill (why?).
The numpy.expand_dims function might be very useful for this: http://docs.scipy.org/doc/numpy/reference/generated/numpy.expand_dims.html

Enable outputting data at each timestep

Even for high frequency data, the 'ts' output option is still an average over the specified sub-annual range. It would be nice to be able to save the data at every timestep on which it was created. This is a necessary condition for using aospy-generated data as input for other aospy functions, c.f. #3. The logic should be relatively straightforward to implement: compute the function at each timestep as usual, and then just skip all further time reductions.

Perhaps not right away, but ultimately there would need to be the ability to specify (or automatically determine) time chunks for splitting the output into different files. For high frequency data, e.g. 3hr or 6hr, even a 1-2 yr file is on the order of 1 GB. So without some chunking, the files will become huge for longer time durations.

Refactor: break Calc class up into smaller classes

I have been thinking casually about this for a long time but was spurred by #32 to formalize it. Even then, its still a work in progress.

Ultimately all of the filesystem IO should be factored out of Calc and into its own specialized classes. Here's how I envision the overall flow, from user input to exporting the results:

User desired input is parsed by some combination of classes (beyond the scope of this issue; see #4) to generate a CalcSuite #9 ).
CalcSuite generates CalcDataGetter object, one per Calc.
Each CalcDataGetter generates DataLoader objects, one for each needed piece of data saved on disk.
The DataLoader objects load their data from disk and creates CalcData objects. To start, CalcData can basically just be a wrapper around xray.DataArray, with xray_data and numpy_data as attributes, holding the xray.DataArray and np.ndarray representations of the data, respectively.
CalcInterface contains only the xray.DataArrays corresponding to the needed data and the minimal amount of additional metadata needed to execute the function.
Time/space reductions are handled by a separate class, Reduction, that accepts a gridded timeseries and an object (its precise form TBD) that specifies all of the reductions to be performed.
Reduction generates a child class, Reducer, one per reduction to be performed, to execute them. Some logic is necessary, or perhaps child classes RegReducer and GriddedReducer,
Each Reducergenerates a AOSData object comprising its xray.DataArray and any other necessary metadata.
Not sure which class should be responsible for this, but one of them then generates DataExporter objects that are responsible for writing the AOSData object to disk/database/whatever.

Use coordinates directly from xarray.DataArrays, not from parent Model

Currently, each Model object is instantiated with paths to various netCDF files which are used to supply basic information about its grid: latitude, longitude, etc. Some physical calculations require this grid information in addition to physical variables themselves. However, the values of these grid properties are not necessarily unique to a Model: simulations run in the same model may be output to different grids (horizontal or vertical), or changes in software may make, e.g. the latitude values off on the 5th digit, causing xray to throw an error (the last example comes from @spencerkclark ).

The netCDF files from which data is loaded all contain the coordinate data locally already, and so do the xray.Dataset and/or DataArray objects derived from those files. So there is no reason to always pull from the Model's grid attributes -- they should be pulled from the local variable itself.

As a starting point, Calc should just grab the needed coordinate/grid values from the 1st variable that it loads in that has those values. Pressure values and vertical pressure thicknesses will likely need to be handled as their own unique case, which is what's done already anyways.

For 'gfdl_repo' and 'one_dir' data, missing data silently ignored

If a date range is specified but the data is only found for a subset of that range, computations are performed on the available data without any indication of the incorrect range, and when saved to disk the label incorrectly indicates the full specified range.

For example, If I ask for 1979-2008 averages on this precipitation data: /archive/pcmdi/repo/CMIP5/output/NSF-DOE-NCAR/CESM1-CAM5/amip/mon/atmos/Amon/r1i1p1/v20120612/pr/pr_Amon_CESM1-CAM5_amip_r1i1p1_197901-200512.nc, the output will be saved as 1979-2008 despite the data stopping at 2005.

Should be straightforward to fix -- just requires adding some checks and raising an exception.

Create base class that `Proj`, `Model`, and `Run` all inherit from

The data structures Proj, Model, and Run don't obey an "is-a" conceptual relationship relative to one another (a model is not a type of project, and a run is not a type of model), and thus shouldn't inherit from each other. It's more of a "has-a" relationship: a project has one or more model; a model has one or more runs).

Nevertheless, each of them can be conceptualized as being a particular kind of container of data describing climate model/obs data saved somewhere on disk. So there should be a parent class, something like InputDataSpecs, that they each inherit from. Since we are wanting all specifications about the data to be made at any of the levels, c.f. #28 #27 and #26, this could be implemented all within this parent class, with any level-specific variations being done through overriding/adding methods within that particular child level. This may ultimately render the Proj, etc. classes with very little unique to them, but they should still be retained to facilitate easy organization/calculation specifications.

Var is a little trickier, because in addition to specifying where the data can be found (although its not really used in this way in my workflow at least), it also specifies physical aspects of the quantity being represented. So this requires more thinking -- one option would be to have two classes, one for each of these functions, with the former also inheriting from this parent class.