jcmgray / xyzpy Goto Github PK

Efficiently generate and analyse high dimensional data.

License: MIT License

Python 100.00%

xarray multidimensional-arrays plot parallel distributed pandas

xyzpy's Introduction

xyzpy is python library for efficiently generating, manipulating and plotting data with a lot of dimensions, of the type that often occurs in numerical simulations. It stands wholly atop the labelled N-dimensional array library xarray. The project's documentation is hosted on readthedocs.

The aim is to take the pain and errors out of generating and exploring data with a high number of possible parameters. This means:

you don't have to write super nested for loops
you don't have to remember which arrays/dimensions belong to which variables/parameters
you don't have to parallelize over or distribute runs yourself
you don't have to worry about loading, saving and merging disjoint data
you don't have to guess when a set of runs is going to finish
you don't have to write batch submission scripts or leave the notebook to use SGE, PBS or SLURM

As well as the ability to automatically parallelize over runs, xyzpy provides the Crop object that allows runs and results to be written to disk, these can then be run by any process with access to the files - e.g. a batch system such as SGE, PBS or SLURM - or just serve as a convenient persistent progress mechanism.

Once your data has been aggregated into a xarray.Dataset or pandas.DataFrame there exists many powerful visualization tools such as seaborn, altair and holoviews / hvplot. To these xyzpy adds also a simple 'oneliner' interface for interactively plotting the data using bokeh, or for static, publication ready figures using matplotlib, whilst being able to see the dependence on up to 4 dimensions at once.

Please see the docs for more information.

xyzpy's People

Contributors

Stargazers

Watchers

Forkers

cotsog toddrme2178 tomektrzeciak chris-n-self toddrjen adamcallison itsmonterey gharib85 gamatos dvp2015 shemian29

xyzpy's Issues

Merging case_runner crop fails

Hi @jcmgray,

I noticed that crop.reap() fails when the there are several crops created by a Harvester that uses a case_runner instance. The reasons seems to be that case_runners return pandas DataFrames instead of xarray DataSets (unlike combo_runners) and hence the merge function is called on the resulting DataFrame object, which fails accordingly:

TypeError: merge() got an unexpected keyword argument 'compat'

My current workaround is to merge all cases manually so that I can avoid creating multiple crops in the first place.

Using holoviews

This package looks great!

You seem to have implemented an interface on top of bokeh and matplotlib, are you aware of the existence of holoviews?

I think that using HoloViews will greatly reduce the amount of code here :)

Accessing intermediate results

Hi jcmgray,

this package is really fantastics, it solves exactly the problems that I've been struggling with for years! Thank for your work!

I've just started using the package, though, and I have a question concerning batch processing: is there any straightforward way to access intermediate results of the computation by storing them on the disk? I've thought about two ways in particular:

Accessing the on-disk dataset created by the harvester. However, by default, the dataset is created only after all combos are evaluated. Is there some workaround / flag to set?
Using the crop functionality. However, I cannot reap the results during the computation since it gives the following error: This crop is not ready to reap yet - results are missing

Any thoughts?

Can't pickle ...

Hi @jcmgray, I'm currently using your awesome package to automate my experiments and noticed a problem related to pickling certain data types. While the cloudpickle backend of joblib should work fine to handle, for example, lambda functions, I get an error when working with certain modules based on torch.

Here is a minimal example:

import xyzpy
import botorch
import torch

@xyzpy.label(['model'])
def fun(a):
	x = torch.tensor([[0.]])
	y = torch.tensor([[0.]])
	return botorch.models.SingleTaskGP(x, y)

combos = dict(
	a=range(10)
)

h = xyzpy.Harvester(fun, 'result')
c = h.Crop('test')
c.sow_combos(combos)
c.grow_missing()
c.reap()

It produces the following error:

_pickle.PicklingError: Can't pickle <function _HomoskedasticNoiseBase.__init__.<locals>.<lambda> at 0x14cd89e18>: it's not found as gpytorch.likelihoods.noise_models._HomoskedasticNoiseBase.__init__.<locals>.<lambda>

Tested with Python 3.7.3 and

botorch==0.2.1
torch==1.6.0
xyzpy==1.0.0

After a short search, I found this related post: cornellius-gp/gpytorch#907
A potential solution seems to be using dill instead of pickle. Do you think this option can be added to xyzpy?

For now, my workaround is to remove all problematic variables from the object returned by function to be evaluated after all internal computations have been completed. However, it would be much nicer, of course, if the objects could be naturally handled by xyzpy.

Kind regards,
Adrian

Failing tests related to netcdf

Hi,

I recently reinstalled xyzpy (from my up-to-date fork of the develop branch) in to a fresh anaconda environment and ran the test suite to make sure I had things set up properly. After installing any missing libraries that had caused tests to fail, I ran the tests again and saw that two tests were still failing with the following error messages:

FAILED tests/test_manage.py::TestSaveAndLoad::test_io_complex_data[h5netcdf-h5netcdf] - h5netcdf.core.CompatibilityError: complex dtypes are not a supported NetCDF feature, and are not allowed by h5net...
FAILED tests/test_manage.py::TestSaveAndLoad::test_save_merge_ds - h5netcdf.core.CompatibilityError: complex dtypes are not a supported NetCDF feature, and are not allowed by h5net...

The actual tests in question look as follows:

    @mark.parametrize(("engine_save, engine_load"),
                      [('h5netcdf', 'h5netcdf'),
                       ('zarr', 'zarr'),
                       ('joblib', 'joblib'),
                       param('h5netcdf', 'netcdf4', marks=mark.xfail),
                       param('netcdf4', 'h5netcdf', marks=mark.xfail),
                       param('netcdf4', 'netcdf4', marks=mark.xfail)])
    def test_io_complex_data(self, ds1, engine_save, engine_load):
        with tempfile.TemporaryDirectory() as tmpdir:
            save_ds(ds1, os.path.join(tmpdir, "test.h5"), engine=engine_save)
            ds2 = load_ds(os.path.join(tmpdir, "test.h5"), engine=engine_load)
            assert ds1.identical(ds2)

    def test_save_merge_ds(self, ds1, ds2, ds3):
        with tempfile.TemporaryDirectory() as tmpdir:
            fname = os.path.join(tmpdir, "test.h5")
           save_merge_ds(ds1, fname)
            save_merge_ds(ds2, fname)
            with raises(xr.MergeError):
                save_merge_ds(ds3, fname)
            save_merge_ds(ds3, fname, overwrite=True)
            exp = ds3.combine_first(xr.merge([ds1, ds2]))
            assert load_ds(fname).identical(exp)

Is there perhaps a specific version requirement for netcdf that is not specified in the docs?

RecursionError when using `allow_incomplete` option on Crop.reap

When I use the allow_incomplete option it sometimes fails with this error (error output shown below). I have an equal size dataset that all completed so I did not need the allow_incomplete option and that reaped fine. I tried creeping up the sys.getrecursionlimit() in the Jupyter notebook I was using and it still failed at 100,000, which is when I thought I should not push my luck any more. I am using the latest GitHub version of xyzpy.

---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
<ipython-input-6-d4907a788b7f> in <module>
----> 1 crop.reap(allow_incomplete=True,) #allow_incomplete=True,overwrite=True,

~/anaconda3/envs/qcoptim-qiskit-up-to-date/lib/python3.8/site-packages/xyzpy/gen/batch.py in reap(self, wait, sync, overwrite, clean_up, allow_incomplete)
    745         if isinstance(self.farmer, Harvester):
    746             opts['overwrite'] = overwrite
--> 747             return self.reap_harvest(self.farmer, **opts)
    748 
    749         if isinstance(self.farmer, Sampler):

~/anaconda3/envs/qcoptim-qiskit-up-to-date/lib/python3.8/site-packages/xyzpy/gen/batch.py in reap_harvest(self, harvester, wait, sync, overwrite, clean_up, allow_incomplete)
    685             raise ValueError("Cannot reap and harvest if no Harvester is set.")
    686 
--> 687         ds = self.reap_runner(harvester.runner, wait=wait, clean_up=clean_up,
    688                               allow_incomplete=allow_incomplete)
    689 

~/anaconda3/envs/qcoptim-qiskit-up-to-date/lib/python3.8/site-packages/xyzpy/gen/batch.py in reap_runner(self, runner, wait, clean_up, allow_incomplete)
    664         # Can ignore `Runner.resources` as they play no part in desecribing the
    665         #   output, though they should be supplied to sow and thus grow.
--> 666         ds = self.reap_combos_to_ds(
    667             var_names=runner._var_names,
    668             var_dims=runner._var_dims,

~/anaconda3/envs/qcoptim-qiskit-up-to-date/lib/python3.8/site-packages/xyzpy/gen/batch.py in reap_combos_to_ds(self, var_names, var_dims, var_coords, constants, attrs, parse, wait, clean_up, allow_incomplete)
    616         check_ready_to_reap(self, allow_incomplete, wait)
    617 
--> 618         clean_up, default_result = calc_clean_up_default_res(
    619             self, clean_up, allow_incomplete
    620         )

~/anaconda3/envs/qcoptim-qiskit-up-to-date/lib/python3.8/site-packages/xyzpy/gen/batch.py in calc_clean_up_default_res(crop, clean_up, allow_incomplete)
    137 
    138     if allow_incomplete:
--> 139         default_result = crop.all_nan_result
    140     else:
    141         default_result = None

~/anaconda3/envs/qcoptim-qiskit-up-to-date/lib/python3.8/site-packages/xyzpy/gen/batch.py in all_nan_result(self)
    422                                "one finished result.")
    423             reference_result = joblib.load(result_files[0])[0]
--> 424             self._all_nan_result = nan_like_result(reference_result)
    425 
    426         return self._all_nan_result

~/anaconda3/envs/qcoptim-qiskit-up-to-date/lib/python3.8/site-packages/xyzpy/gen/batch.py in nan_like_result(res)
    124 
    125     try:
--> 126         return tuple(np.broadcast_to(np.nan, infer_shape(x)) for x in res)
    127     except TypeError:
    128         return np.nan

~/anaconda3/envs/qcoptim-qiskit-up-to-date/lib/python3.8/site-packages/xyzpy/gen/batch.py in <genexpr>(.0)
    124 
    125     try:
--> 126         return tuple(np.broadcast_to(np.nan, infer_shape(x)) for x in res)
    127     except TypeError:
    128         return np.nan

~/anaconda3/envs/qcoptim-qiskit-up-to-date/lib/python3.8/site-packages/xyzpy/gen/batch.py in infer_shape(x)
    102     try:
    103         shape += (len(x),)
--> 104         return shape + infer_shape(x[0])
    105     except TypeError:
    106         return shape

... last 1 frames repeated, from the frame below ...

~/anaconda3/envs/qcoptim-qiskit-up-to-date/lib/python3.8/site-packages/xyzpy/gen/batch.py in infer_shape(x)
    102     try:
    103         shape += (len(x),)
--> 104         return shape + infer_shape(x[0])
    105     except TypeError:
    106         return shape

RecursionError: maximum recursion depth exceeded while calling a Python object

release?

The lack of a recent pypi release is making people think this great package is unmaintained (pydata/xarray#1914 (comment)).

Mind making a release?

As an aside, this is fairly easy to automate today.

migrate to xarray-contrib? and/or write a blogpost?

Hey, we're slowly accumulating a few xarray-ecosystem packages at https://github.com/xarray-contrib to help with visibility (https://github.com/xarray-contrib/xarray-contrib)

If interested, we'd be happy to help you migrate xyzpy over. If not, feel free to close :)

PS: We also have a blog now. If you're up for it, it'd be great to publish a short intro to xyzpy.

`var_names` not needed when DataArrays are passed

@dcherian pointed me towards this library in pydata/xarray#7498, it looks awesome!

Very small point — when I have something like the linked example, but add a cast to a DataArray in the func:

def generate_timeseries(x, y):
    return xr.DataArray(np.random.normal(loc=x, scale=y, size=100))

...then var_names doesn't seem to do anything; I get back a DataArray anyway (which is great!). But it still requires passing something, so I'm just passing Runner(gen, var_names=["foo"]) atm.

Could we avoid having to pass anything there? Or is there a different construction I should be choosing?

cross-pollination from/to `adaptive`

I am impressed with this package!

In my field we very often do these loops over multiple dimensions and generate many curves for different dimensions.

We (me and my colleagues) tried to tackle a very similar problem that xyzpy is trying to solve. We wrote adaptive that does things similar to xyzpy, the biggest difference is that it can adaptively sample one (or two) of the dimensions.

As an example I adapted your Basic Output Example to do the same but with adaptive:

import adaptive
import holoviews as hv
from functools import partial
from itertools import product
from scipy.special import eval_jacobi
import numpy as np
adaptive.notebook_extension()

def jacobi(x, n, alpha, beta):
     return eval_jacobi(n, alpha, beta, x)

combos = {
    'n': [1, 2, 4, 8, 16],
    'alpha': np.linspace(0, 2, 3),
    'beta': np.linspace(0, 1, 5),
}

def named_product(**items):
    names = items.keys()
    vals = items.values()
    return [dict(zip(names, res)) for res in product(*vals)]

learners = {}
for combo in named_product(**combos):
    learners[tuple(combo.values())] = adaptive.Learner1D(partial(jacobi, **combo), bounds=[0, 1])
    
balancing_learner = adaptive.BalancingLearner(list(learners.values()))

which creates "learners", which are essentially objects from which you can request new points and tell new points to.

then you "learn" the function by creating a Runner (this doesn't block the kernel and runs on all the cores, optionally you provide a excecutor to run it on a cluster)

runner = adaptive.Runner(balancing_learner, goal=lambda learner: learner.loss() < 0.01)
runner.live_info()

Then plot the data with:

balancing_learner.plot(cdims=named_product(**combos)).overlay('beta').grid()

As you can see, it is not nearly as short are your code and neither do we provide the functionality to save the data. Also the interface we have is not really optimized easily generate the combos, but this is where we can learn from xyzpy. On the other hand I think there is probably something usefull for you in adaptive too.

(P.S. this is not really an "issue", but more of a place to hopefully exchange some ideas)

EDIT
Inspired on your work, I've created this PR, after which one can just do:

learner = adaptive.BalancingLearner.from_combos(
    jacobi, adaptive.Learner1D, dict(bounds=(0, 1)), combos)
runner = adaptive.BlockingRunner(learner, goal=lambda l: l.loss() < 0.01)
learner.plot(cdims=adaptive.utils.named_product(**combos)).overlay('beta').grid()

conda package missing?

Documentation states that there should be conda package.
I cannot find it though... Am I missing something?

petrucci ➜  ~ conda search xyzpy
Loading channels: done

PackagesNotFoundError: The following packages are not available from current channels:

  - xyzpy

Current channels:

  - https://conda.anaconda.org/conda-forge/linux-64
  - https://conda.anaconda.org/conda-forge/noarch
  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/free/linux-64
  - https://repo.anaconda.com/pkgs/free/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch
  - https://repo.anaconda.com/pkgs/pro/linux-64
  - https://repo.anaconda.com/pkgs/pro/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

The "pool" argument is a little confusing

It actually needs the concurrent.futures.Executor API. I tried to pass in a thread pool from multiprocessing and it didn't work.

Need numba in install_requires

Installing xyzpy with

pip install xyzpy

does not install the required numba package since it is not included in the install_requires argument in setup.py.

ps. Thank you for working on this project, it looks very useful!