Giter Club home page Giter Club logo

pytesmo's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytesmo's Issues

bias calculation

Hey! Bias is currently np.mean(observations) - np.mean(predictions), but should be the other way around, right? @iteubner also mentioned this some time ago.

Revisit temporal collocation and add collocation examples to documentation

Temporal collocation should be revisited (e.g. performance, features) and more tests should be added. In particular, more information should be provided how exactly it works and how duplicates are treated. Maybe a plot summarizing temporal collocation information/results would be already a good step forward.

drop_duplicates=True - users need to understand the impact of turning duplicates on, which means from a scatterplot point of view several y-values have the same x-values, which can have a positive/negative impact on the overall metric depending on the x-position and spread of y-values (e.g. linear regression). Furthermore, duplicates might change depending on the reference, because matching a 12-hour sampled data set with a 3-hour sampled data set leads to different results depending on the selected reference data set.

Special strategy for matching soil moisture data sets? I haven't investigated the impact so far, but for soil moisture it probably makes sense to use a certain strategy during temporal matching. E.g. Matching the reference data set with past observations may cause a mismatch because a rain event could have happened during both observations and only the 2nd data sets has captured it. Unfortunately, the same could happen if the future observation is used for matching. I think it is not so easy and some tests are necessary to decide how to match in the best way, because even two measurements are close (e.g. 6 hours) in time you cannot rule out that a rain event was in between, or if both should have seen it but the soil moisture retrieval is wrong/bad.

In summary, time matters and we should make users aware of that during the computation of metrics.

nan values for estimated error from approach 2 triple collocation/pytesmo

Hi,
I have 3 sets of data and i run both codes on them:
Approach 1: pytesmo.scaling( ) and pytesmo.metrics.tcol_error( )
It return 3 values (ex,ey,ez)
Approach 2: pytesmo.metrics.tcol_snr( ),
It returns 9 values (first 3: SNR, second 3: (ex,ey,ez) and third 3: beta)

From approach 2, I received some "nan" as a result for SNR and the same for error (60 "nan" for 60 days out of 950 days). I assume the amount of error should be very low or negative. How can I cope with this problem?
Did anybody experience the same problem?

lin_cdf_match and 'x must be strictly increasing' on SciPy > 1.0

I've run into ISMN timeseries files that are turned into a series of NaNs by scaling (e.g. ARM/Anthony/ARM_ARM_Anthony_sm_0.025000_0.025000_SMP1-A_19780101_20180712.stm at lon: -98.0969, lat: 37.2134).

This seems to happen in pytesmo.scaling.gen_cdf_match (scaling.py, line 403) because SciPy's InterpolatedUnivariateSpline throws ValueError('x must be strictly increasing'). Pytesmo catches this exception and instead gives a warning: "Too few percentiles for chosen k." (Are those two things really equivalent?). Then pytesmo returns a column of NaNs.

The exception seems to have been introduced in SciPy 1.0 according to this SciPy ticket: scipy/scipy#8535

If I downgrade to scipy 0.19.1 the scaling works (however the metrics calculated later on are still NaNs).

Is there a way to scale these timeseries anyway (and not get just NaNs)? Or are the timeseries just somehow corrupted?

The issue can be reproduced with SciPy 1.1.0 when running:

    def test_strictly_increasing():
        
        n = 1000
        x = np.ones(n) * 0.5
        y = np.ones(n)

        s = scaling.lin_cdf_match(y, x)

ModuleNotFoundError from mkl_fft-1.0.4

When testing pytesmo in a freshly created conda env, I get issues with mkl_fft-1.0.4 throwing ModuleNotFoundError during the tests.

The env is created with:

conda create --yes -n deptest -c conda-forge python=2.7 numpy scipy pandas=0.22 netCDF4 cython pytest pip matplotlib pyproj

and that automatically installs mkl_fft-1.0.4 on the machines I'm testing on.

The unit tests then fail with:

Traceback (most recent call last):
File "setup.py", line 66, in <module>
    setup_package()
File "setup.py", line 62, in setup_package
    use_pyscaffold=True)
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/setuptools/__init__.py", line 131, in setup
    return distutils.core.setup(**attrs)
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/distutils/core.py", line 151, in setup
    dist.run_commands()
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/distutils/dist.py", line 953, in run_commands
    self.run_command(cmd)
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/distutils/dist.py", line 972, in run_command
    cmd_obj.run()
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/code/.eggs/PyScaffold-2.5.11-py2.7.egg/pyscaffold/pytest_runner.py", line 101, in run
    self.with_project_on_sys_path(self.run_tests)
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/setuptools/command/test.py", line 126, in with_project_on_sys_path
    with self.project_on_sys_path():
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/setuptools/command/test.py", line 154, in project_on_sys_path
    self.run_command('build_ext')
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/distutils/cmd.py", line 326, in run_command
    self.distribution.run_command(command)
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/distutils/dist.py", line 972, in run_command
    cmd_obj.run()
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/distutils/command/build_ext.py", line 340, in run
    self.build_extensions()
File "setup.py", line 41, in build_extensions
    numpy_incl = pkg_resources.resource_filename('numpy', 'core/include')
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1135, in resource_filename
    return get_provider(package_or_requirement).get_resource_filename(
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 351, in get_provider
    __import__(moduleOrReq)
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/numpy/__init__.py", line 158, in <module>
    from . import fft
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/numpy/fft/__init__.py", line 15, in <module>
    import mkl_fft._numpy_fft as _nfft
File "/data/jenkins/workspace/Pytesmo_installation/pytesmo_install/pytesmo_install/pytesmoenv/lib/python2.7/site-packages/mkl_fft/__init__.py", line 27, in <module>
    from ._pydfti import (fft, ifft, fft2, ifft2, fftn, ifftn, rfft, irfft,
File "mkl_fft/_pydfti.pyx", line 32, in init mkl_fft._pydfti
NameError: name 'ModuleNotFoundError' is not defined

However, if I install mkl_fft with pip (instead of conda), I get version 1.0.2 which doesn't cause this problem. I can also uninstall mkl_fft and run the tests without problems.

Yeah, I know that on Travis this problem doesn't seem to occur - I guess because mkl_fft isn't installed there (probably because the CPUs there don't identify as Intel?)

So there is a workaround but it looks like compiling and testing pytesmo doesn't work out of the box for Python 2.7 any more?

Scaling with NaN

Hi! I'm using the linreg scaling for bias correction. Currently I get an error when the candidate and reference time series contain NaNs (at different points in time), because the regression cannot be calculated if the 2 TS don't match. But why can't we calculate the model from the coinciding values (dropna before linreg) and apply the correction to ALL values of the candidate (with nans)? Then the not-nan candidate values would be scaled and no values are dropped? I think this would be a better solution, but I guess there's also a reason why you implemented it differently?

RuntimeError

The data manager of the validation framework catches only IOErrors, but e.g. the netcdf4 raises a RuntimeError in case file doesn't exist see here.

Solution would be capturing also a RuntimeError

e.g.
except (RuntimeError, IOError) as err:

data_manager drops na values

the data_manager drops the na values during reading.
This can be dangerous if the dataframe contains several columns of which only one column has a na value. So we should either change this to dropna(how='all') or remove it completely and let the data reader or data preparation take care of it.

@aplocon Is there any other reason you can think of why it would make sense to drop the nan values in https://github.com/TUW-GEO/pytesmo/blob/develop/pytesmo/validation_framework/data_manager.py#L159 or https://github.com/TUW-GEO/pytesmo/blob/develop/pytesmo/validation_framework/data_manager.py#L209

numpy issue with timedate.doy

After an update of numpy, timedate.doy didn't work any more because of the addition of a boolean type array to an integer array (Line 306)... I could resolve it by a simple type conversion:

day_of_year = day_of_year - nonleap_years.astype('int') +
np.logical_and(day_of_year < 60, nonleap_years).astype('int')

Not sure if this was really a numpy issue or something else, but this made it work again.

ISMN reader produces different indices for same stations on different systems

Like it says in the title: The ISMN reader (in particular get_dataset_ids) produces different indices for the same stations on different systems. This makes it harder to check errors occurring in a production system on your developer machine.

I think this comes from the metadata collector (pytesmo.io.ismn.metadata_collector.collect_from_folder) line 57, which uses os.walk. os.walk doesn't guarantee an order but we could sort the folders and files lists to alphabetical order. This would take care of the problem, barring locale issues (different sorting order from different locales).

Refactor and improve scaling module

The current implementations of the scaling functions need some attention and more comments.

From an email by y.zeng@...

As for the usage of scaling module, I found some part of the code not easy to digest. Could you help me with some explanation?

For the “scaling.lin_cdf_match”, I found lines as below:
percentiles = [0, 5, 10, 30, 50, 70, 90, 95, 100]

in_data_pctl = np.array(np.percentile(in_data, percentiles))
scale_to_pctl = np.array(np.percentile(scale_to, percentiles))

uniq_ind = np.unique(in_data_pctl, return_index=True)[1]
in_data_pctl = in_data_pctl[uniq_ind]
scale_to_pctl = scale_to_pctl[uniq_ind]

uniq_ind = np.unique(scale_to_pctl, return_index=True)[1]
in_data_pctl = in_data_pctl[uniq_ind]
scale_to_pctl = scale_to_pctl[uniq_ind]

f = sc_int.interp1d(in_data_pctl, scale_to_pctl)

ff=f(in_data)

I am wondering what is the function of the lines in the middle, which is colored as brown. It seems to me the direct use of “in_data_pctl” and “scale_to_pctl” in “sc_int.interp1d”, already can give you reasonable results.

Could you help to interpret a bit on the intentions of using this brownish part of lines?

We should also implement the separate estimation and application of CDF scaling parameters.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.