tuw-geo / pytesmo Goto Github PK

View Code? Open in Web Editor NEW

75.0 75.0 45.0 27.38 MB

python Toolbox for the Evaluation of Soil Moisture Observations

Home Page: https://pytesmo.readthedocs.io

License: Other

Python 96.65% Cython 3.35%

climers earth-science geo python remote-sensing soil-moisture validation

pytesmo's People

Stargazers

Watchers

pytesmo's Issues

Replace base classes by those from pygeobase

AscatH25_SSM: rename read_ssm to read_ts; read should return pandas.DataFrame or pandas.TimeSeries

Make plot.show of ISMN map optional

https://github.com/TUW-GEO/pytesmo/blob/master/pytesmo/io/ismn/interface.py#L778 should be optional to enable saving of the plot from code.

pytesmo.resample should also work with an array

Resampling fails if the input data is not a dict. Should also work in the case of just 1 array

subclassing pytesmo BasicGrid and CellGrid does not work

code in grids.py must be changed to

BasicGrid = grids.BasicGrid

CellGrid = grids.CellGrid

with a warning on import

remove test-data from MANIFEST.IN and update developer instructions

The downloadable package in pypi should be kept small. Developers or people testing the readers will have to add the test data from the test-data repository.

Add up to date triple collocation implementation and tutorial

basicGrid to_cell_grid must also give shape and other attributes to cell_grid

julian date conversion seems to fail with certain version of numpy when only 1 element is converted

This has happened in numpy 1.10.4 in Windows but not in linux during CI.

pytesmo grid find nearest gpi, missing documentation about unit of max_dist

Create method for checking if a class is a cell class

This method is needed since type checking is error prone

Check H14 reader with new pygrib version

It seems that the pygrib version or pygrib api was changed to return different values for the parametername so the reader fails.

lut use in validation framework gives wrong grid point indices

https://github.com/TUW-GEO/pytesmo/blob/master/pytesmo/validation_framework/validation.py#L146-L147 is not necessary. For e.g. a lut using the ASCAT DGG as a reference and the GLDAS grid as the other grid the lut is made to use the gpi directly.

@aplocon was there any specific reason for doing it like it is implemented now?

Ascat Reader segfaults without read bulk if multiple cells are involved

Ncfile is not closed properly

TypeError: invalid type promotion at scipy.interpolate.griddata

At this line comes a

TypeError: invalid type promotion

which is raised from scipy.interpolate.griddata. It is only a problem for certain Scipy version.

bias calculation

Hey! Bias is currently np.mean(observations) - np.mean(predictions), but should be the other way around, right? @iteubner also mentioned this some time ago.

Revisit temporal collocation and add collocation examples to documentation

Temporal collocation should be revisited (e.g. performance, features) and more tests should be added. In particular, more information should be provided how exactly it works and how duplicates are treated. Maybe a plot summarizing temporal collocation information/results would be already a good step forward.

drop_duplicates=True - users need to understand the impact of turning duplicates on, which means from a scatterplot point of view several y-values have the same x-values, which can have a positive/negative impact on the overall metric depending on the x-position and spread of y-values (e.g. linear regression). Furthermore, duplicates might change depending on the reference, because matching a 12-hour sampled data set with a 3-hour sampled data set leads to different results depending on the selected reference data set.

Special strategy for matching soil moisture data sets? I haven't investigated the impact so far, but for soil moisture it probably makes sense to use a certain strategy during temporal matching. E.g. Matching the reference data set with past observations may cause a mismatch because a rain event could have happened during both observations and only the 2nd data sets has captured it. Unfortunately, the same could happen if the future observation is used for matching. I think it is not so easy and some tests are necessary to decide how to match in the best way, because even two measurements are close (e.g. 6 hours) in time you cannot rule out that a rain event was in between, or if both should have seen it but the soil moisture retrieval is wrong/bad.

In summary, time matters and we should make users aware of that during the computation of metrics.

Resampling should respect dtype of data.

At the moment the output array is set to float64 for any input dataset.

See https://github.com/TUW-GEO/pytesmo/blob/master/pytesmo/grid/resample.py#L163-L165

nan values for estimated error from approach 2 triple collocation/pytesmo

Hi,
I have 3 sets of data and i run both codes on them:
Approach 1: pytesmo.scaling( ) and pytesmo.metrics.tcol_error( )
It return 3 values (ex,ey,ez)
Approach 2: pytesmo.metrics.tcol_snr( ),
It returns 9 values (first 3: SNR, second 3: (ex,ey,ez) and third 3: beta)

From approach 2, I received some "nan" as a result for SNR and the same for error (60 "nan" for 60 days out of 950 days). I assume the amount of error should be very low or negative. How can I cope with this problem?
Did anybody experience the same problem?

time_series.plotting module has wrong import

lin_cdf_match and 'x must be strictly increasing' on SciPy > 1.0

I've run into ISMN timeseries files that are turned into a series of NaNs by scaling (e.g. ARM/Anthony/ARM_ARM_Anthony_sm_0.025000_0.025000_SMP1-A_19780101_20180712.stm at lon: -98.0969, lat: 37.2134).

This seems to happen in pytesmo.scaling.gen_cdf_match (scaling.py, line 403) because SciPy's InterpolatedUnivariateSpline throws ValueError('x must be strictly increasing'). Pytesmo catches this exception and instead gives a warning: "Too few percentiles for chosen k." (Are those two things really equivalent?). Then pytesmo returns a column of NaNs.

The exception seems to have been introduced in SciPy 1.0 according to this SciPy ticket: scipy/scipy#8535

If I downgrade to scipy 0.19.1 the scaling works (however the metrics calculated later on are still NaNs).

Is there a way to scale these timeseries anyway (and not get just NaNs)? Or are the timeseries just somehow corrupted?

The issue can be reproduced with SciPy 1.1.0 when running:

    def test_strictly_increasing():
        
        n = 1000
        x = np.ones(n) * 0.5
        y = np.ones(n)

        s = scaling.lin_cdf_match(y, x)

boxcar filter fails when array of size 1 is the input

temporal matching in version 0.3.2 is painfully slow

I guess this is because of my usage of the cKDTree in 2D. We should find another way.

Add bootstraping for calculations of confidence intervals

Especially relevant for RMSD etc.

move data preparation before period checks

In this way data_prep can be used for conversion of e.g. numpy arrays to pandas.DataFrame.

temporal matching produces no match on boundaries

This is related to scipy/scipy#4233

We should use http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.interpolate.NearestNDInterpolator.html#scipy.interpolate.NearestNDInterpolator directly.

ModuleNotFoundError from mkl_fft-1.0.4

When testing pytesmo in a freshly created conda env, I get issues with mkl_fft-1.0.4 throwing ModuleNotFoundError during the tests.

The env is created with:

conda create --yes -n deptest -c conda-forge python=2.7 numpy scipy pandas=0.22 netCDF4 cython pytest pip matplotlib pyproj

and that automatically installs mkl_fft-1.0.4 on the machines I'm testing on.

The unit tests then fail with:

Traceback (most recent call last):
File "setup.py", line 66, in <module>
    setup_package()
File "setup.py", line 62, in setup_package
    use_pyscaffold=True)
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/setuptools/__init__.py", line 131, in setup
    return distutils.core.setup(**attrs)
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/distutils/core.py", line 151, in setup
    dist.run_commands()
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/distutils/dist.py", line 953, in run_commands
    self.run_command(cmd)
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/distutils/dist.py", line 972, in run_command
    cmd_obj.run()
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/code/.eggs/PyScaffold-2.5.11-py2.7.egg/pyscaffold/pytest_runner.py", line 101, in run
    self.with_project_on_sys_path(self.run_tests)
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/setuptools/command/test.py", line 126, in with_project_on_sys_path
    with self.project_on_sys_path():
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/setuptools/command/test.py", line 154, in project_on_sys_path
    self.run_command('build_ext')
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/distutils/cmd.py", line 326, in run_command
    self.distribution.run_command(command)
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/distutils/dist.py", line 972, in run_command
    cmd_obj.run()
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/distutils/command/build_ext.py", line 340, in run
    self.build_extensions()
File "setup.py", line 41, in build_extensions
    numpy_incl = pkg_resources.resource_filename('numpy', 'core/include')
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1135, in resource_filename
    return get_provider(package_or_requirement).get_resource_filename(
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 351, in get_provider
    __import__(moduleOrReq)
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/numpy/__init__.py", line 158, in <module>
    from . import fft
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/numpy/fft/__init__.py", line 15, in <module>
    import mkl_fft._numpy_fft as _nfft
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/mkl_fft/__init__.py", line 27, in <module>
    from ._pydfti import (fft, ifft, fft2, ifft2, fftn, ifftn, rfft, irfft,
File "mkl_fft/_pydfti.pyx", line 32, in init mkl_fft._pydfti
NameError: name 'ModuleNotFoundError' is not defined

However, if I install mkl_fft with pip (instead of conda), I get version 1.0.2 which doesn't cause this problem. I can also uninstall mkl_fft and run the tests without problems.

Yeah, I know that on Travis this problem doesn't seem to occur - I guess because mkl_fft isn't installed there (probably because the CPUs there don't identify as Intel?)

So there is a workaround but it looks like compiling and testing pytesmo doesn't work out of the box for Python 2.7 any more?

HSAF test need to much memory for CI

Use pygeobase base classes for all datasets

ASCAT
ISMN
H-SAF

Scaling with NaN

Hi! I'm using the linreg scaling for bias correction. Currently I get an error when the candidate and reference time series contain NaNs (at different points in time), because the regression cannot be calculated if the 2 TS don't match. But why can't we calculate the model from the coinciding values (dropna before linreg) and apply the correction to ALL values of the candidate (with nans)? Then the not-nan candidate values would be scaled and no values are dropped? I think this would be a better solution, but I guess there's also a reason why you implemented it differently?

RuntimeError

The data manager of the validation framework catches only IOErrors, but e.g. the netcdf4 raises a RuntimeError in case file doesn't exist see here.

Solution would be capturing also a RuntimeError

e.g.
except (RuntimeError, IOError) as err:

Remove pytesmo.grid and move to new package

grids.netcdf is not available

This should also issue a DeprecationWarning instead of failing.

No return_clim keyword in calc_anomaly

There is no return_clim keyword in calc_anomaly in the definition but used here as parameter.

AscatH25_ssm should support the read_bulk keyword

H07 should return more variables

Best to specify a list of variables of interest during init

Fix issues related to pandas 0.23

Some tests are failing because pandas 0.23 is more restrictive, I think it shouldn't be a lot of effort to fix them.

https://pandas.pydata.org/pandas-docs/stable/whatsnew.html

Fix pytesmo dependecies on pygeogrids

pytesmo now depends on https://github.com/TUW-GEO/pygeogrids

ASCAT does not provide output data if snow or frozen flags are not valid

If snow or frozen flags are masked arrays the DataFrame is filled with NaN values. During masking all data is wrongly masked

grid.resample uses wrong weighting function if method is a dict and method is 'nn'

pip install from requirements fails if numpy not already installed

This is because we use np.get_include which can not work if numpy is not installed.

A fix could look something like that

data_manager drops na values

the data_manager drops the na values during reading.
This can be dangerous if the dataframe contains several columns of which only one column has a na value. So we should either change this to dropna(how='all') or remove it completely and let the data reader or data preparation take care of it.

@aplocon Is there any other reason you can think of why it would make sense to drop the nan values in https://github.com/TUW-GEO/pytesmo/blob/develop/pytesmo/validation_framework/data_manager.py#L159 or https://github.com/TUW-GEO/pytesmo/blob/develop/pytesmo/validation_framework/data_manager.py#L209

numpy issue with timedate.doy

After an update of numpy, timedate.doy didn't work any more because of the addition of a boolean type array to an integer array (Line 306)... I could resolve it by a simple type conversion:

day_of_year = day_of_year - nonleap_years.astype('int') +
np.logical_and(day_of_year < 60, nonleap_years).astype('int')

Not sure if this was really a numpy issue or something else, but this made it work again.

move matplotlib basemap to optional requirements

Check if cartopy is easier to install and if a switch would be possible

ISMN reader produces different indices for same stations on different systems

Like it says in the title: The ISMN reader (in particular get_dataset_ids) produces different indices for the same stations on different systems. This makes it harder to check errors occurring in a production system on your developer machine.

I think this comes from the metadata collector (pytesmo.io.ismn.metadata_collector.collect_from_folder) line 57, which uses os.walk. os.walk doesn't guarantee an order but we could sort the folders and files lists to alphabetical order. This would take care of the problem, barring locale issues (different sorting order from different locales).

implement framework for doing calculations over a dataset

like the validation framework but for only one dataset and a custom function

Refactor and improve scaling module

The current implementations of the scaling functions need some attention and more comments.

From an email by y.zeng@...

As for the usage of scaling module, I found some part of the code not easy to digest. Could you help me with some explanation?

For the “scaling.lin_cdf_match”, I found lines as below:
percentiles = [0, 5, 10, 30, 50, 70, 90, 95, 100]

in_data_pctl = np.array(np.percentile(in_data, percentiles))
scale_to_pctl = np.array(np.percentile(scale_to, percentiles))

uniq_ind = np.unique(in_data_pctl, return_index=True)[1]
in_data_pctl = in_data_pctl[uniq_ind]
scale_to_pctl = scale_to_pctl[uniq_ind]

uniq_ind = np.unique(scale_to_pctl, return_index=True)[1]
in_data_pctl = in_data_pctl[uniq_ind]
scale_to_pctl = scale_to_pctl[uniq_ind]

f = sc_int.interp1d(in_data_pctl, scale_to_pctl)

ff=f(in_data)

I am wondering what is the function of the lines in the middle, which is colored as brown. It seems to me the direct use of “in_data_pctl” and “scale_to_pctl” in “sc_int.interp1d”, already can give you reasonable results.

Could you help to interpret a bit on the intentions of using this brownish part of lines?

We should also implement the separate estimation and application of CDF scaling parameters.

Don't remove grids completely but issue a DeprecationWarning

substitute by the functions in pygeogrids but issue a Warning if imported

Grid should also be compatible with 2D input arrays

set shape automatically from them

tuw-geo / pytesmo Goto Github PK

pytesmo's People

Stargazers

Watchers

Forkers

pytesmo's Issues

Recommend Projects

Recommend Topics

Recommend Org