Giter Club home page Giter Club logo

pmda's Issues

rms.RMSD should be able to superimpose

Expected behaviour

The MDA class superimposes by default. The paralllel pmda.rms.RMSD does not superimpose. I would expect the two to behave broadly similar, i.e., the default should be superimposing.

Actual behaviour

Just raw rmsd calculation. No option to switch on superposition.

Code to reproduce the behaviour

See #24

Stacking results fails with odd number of frames

I've got a trajectory (7367 frames) that was split across four CPUs, resulting in partial trajectories with 3x 1842 and 1x 1841 frames. Running np.hstack(self._results) does not work, it should either be np.vstack or np.concatenate.

pmda/pmda/custom.py

Lines 102 to 103 in 723d021

def _conclude(self):
self.results = np.hstack(self._results)

Expected behaviour

Results are concatenated over the frame axis.

Actual behaviour

ValueError is raised.

Code to reproduce the behaviour

import numpy as np
data = np.zeros([7367, 10])
print(data.shape)
# (7367, 10)

data_frames = np.array_split(data, 4)
print([frames.shape[0] for frames in data_frames])
# [1842, 1842, 1842, 1841]

np.hstack(data_frames)
# [...]
# ValueError: all the input array dimensions except for the concatenation axis must match exactly

np.vstack(data_frames)
# array([[0., 0., 0., ..., 0., 0., 0.],
#       [0., 0., 0., ..., 0., 0., 0.],
#       [0., 0., 0., ..., 0., 0., 0.],
#       ...,
#       [0., 0., 0., ..., 0., 0., 0.],
#       [0., 0., 0., ..., 0., 0., 0.],
#       [0., 0., 0., ..., 0., 0., 0.]])

Currently version of MDAnalysis:

(run python -c "import MDAnalysis as mda; print(mda.__version__)")
0.18.0

(run python -c "import pmda; print(pmda.__version__)")
0.1.1

(run python -c "import dask; print(dask.__version__)")
0.19.1

updated docs with explicit scheduler setting example

Expected behaviour

PMDA should run with "best performance" out of the box.

At a minimum, the docs should be clear what one has to do.

The dask docs recommend distributed for GIL-bound code so in the following I use it as the default, but we should benchmark the single machine schedulers

  • thread
  • processes
  • distributed

For using distributed:

from dask.distributed import Client
client = Client()

import pmda
...

or configure the scheduler

dask.config.set(scheduler='distributed')

Actual behaviour

With PMDA now using dask's preferred way to select a scheduler (#66), we now default to Dask's default scheduler. For delayed(), this is the threads scheduler, which does not work well with our Python based code: the GIL serializes the tasks and I expect that performance is poor out of the box.

Hydrogen bonds analysis using PMDA

Following a discussion on MDAnalysis forum would it possible to build a function to perform hydrogen bonds analysis using MDAnalysis.analysis.hbonds module in parallel with PMDA to accelerate the running time for long MD trajectories?

I would be happy to be part of it if you had any advices to where to start?

release 0.3.0

We should create a 0.3.0 release (REU contributions and other fixes/enhancements).

  • tagged
  • GitHub release
  • PyPi release
  • conda-forge release

update with SciPy paper citation

The SciPy paper is published https://conference.scipy.org/proceedings/scipy2019/shujie_fan.html

  1. Shujie Fan, Max Linke, Ioannis Paraskevakos, Richard J. Gowers, Michael Gecht, and Oliver Beckstein. PMDA - Parallel Molecular Dynamics Analysis. In Chris Calloway, David Lippa, Dillon Niederhut, and David Shupe, editors, Proceedings of the 18th Python in Science Conference, pages 134 – 142, 2019. doi: 10.25080/Majora-7ddc1dd1-013

Update all docs with the proper paper reference.

  • README
  • sphinx docs
  • comment header in files (?)

cc: @MDAnalysis/pmda

Use future API to support parallel AlignTraj

This is a proof of concept script how to use the futures api of dask to support the writing of results into a trajectory while still working on frames in parallel. It won't scale out a lot. It also assumes that the single_frame function returns a namedtuple with coordinates. The buffer can potentially blow up so use with care.

def dask_execution(single_frame, trajectory, client)
     # have writer in main process
     with Writer as w:
         futures = client.map(single_frame, trajectory)
         idx = 0
         for f in as_completed(futures):
             res = f.result()
             buf[res.frame] = res.xyz
             # check if we got a  result to write
             while True:
                 # write everything we can *in order*
                 if idx in buf:
                     w.write(buf[idx])
                     del buf[idx]
                     idx += 1
                # we can't write anything so break
                 else:
                     break
        # write leftover frames
         for i in sorted(buf):
             w.write(buf[i])

stop when using memory reader

The memory reader will not work currently because we are recreating a universe for each process. For it to work we need to special handle it. As a first step though we should warn the user that memory reader is not supported.

create a 0.1.0 release

The library is useful so we should release a 0.1.0.

  • GitHub
  • pypi
  • conda

@kain88-de do you have time to set up the conda infrastructure?

I'll register with zenodo so that we can also have a DOI, at least for right now.

Parallel implementation not leveraging the lightweight serializations/de-serializations of MDA

Right now parallel analyses do their own sort of pickling/unpickling. While this gives the analysis more control and possibly skips unneeded pickle/unpickle cruft, it has a number of drawbacks:

  • Subclasses must be explicitly written to make an analysis parallel. This makes automated parallelization of existing analyses hard;
  • Each parallel analysis must implement its own pickling/unpickling strategy, leading to code duplication among analyses and compared to the existing pickling/unpickling strategy;
  • Complexities of unpickling already solved with the MDA machinery have to be dealt with again (#13, #18) .

I vote we rely on the Universe's and AtomGroup's pickling/unpickling behavior. If we could do that then making a parallel analysis would become an automated wrapper and the user need not subclass ParallelAnalysisBase themselves.

I understand that as it is right now it can't work as I want, because we must explicitly tell the analysis that it should pickle along the Universe (otherwise AtomGroup unpickling will fail). We never got around to implementing auto-loading of a Universe when unpickling an AtomGroup (MDAnalysis/mdanalysis#878). At the time it seemed selfishly premature of me, since that would only address my own parallelization needs. Now that this becomes more widespread it might be good to revisit the idea, what do you think?

check scheduler names

We set the compute() scheduler near

pmda/pmda/parallel.py

Lines 332 to 337 in 34d465b

if scheduler is None:
scheduler = 'multiprocessing'
if n_blocks is None:
if scheduler == 'multiprocessing':
n_blocks = n_jobs
as

  • "multiprocessing"
  • assume single threaded if n_jobs=1
  • Client (from distributed)

However, in Dask 2.1+ https://docs.dask.org/en/latest/scheduling.html we should use

  • "processing"
  • "synchronous"

We need to check that setting the scheduler in PMDA to multiprocessing or single-threaded still works.

Check the versions of dask and their changelog – did they change the names at some time? If so, require a minimal version of dask...

This is also important for testing of schedulers in

pmda/conftest.py

Lines 28 to 37 in 34d465b

@pytest.fixture(scope='session', params=('distributed',
'multiprocessing',
'single-threaded'))
def scheduler(request, client):
if request.param == 'distributed':
arg = client
else:
arg = request.param
with dask.config.set(scheduler=arg):
yield

distributed.client error with atomgroup or universe type self attribute

Distributed

An error is raised when one uses atomgroup or universe type self attribute with distributed.client dsak scheduler.

One example is rdf.py (PR #70 ), which had one line

self.ags = ags

And it raised one error;

______________________ test_same_result[distributed-2-1] _______________________

u = <Universe with 43480 atoms>
sels = [[<AtomGroup with 1 atom>, <AtomGroup with 2 atoms>], [<AtomGroup with 2 atoms>, <AtomGroup with 2 atoms>]]
n_blocks = 1, scheduler = None

    @pytest.mark.parametrize("n_blocks", [1, 2, 3, 4])
    def test_same_result(u, sels, n_blocks, scheduler):
        # should see same results from analysis.rdf.InterRDF_s
        # and pmda.rdf.InterRDF_s
        nrdf = rdf.InterRDF_s(u, sels).run()
>       prdf = InterRDF_s(u, sels).run(n_blocks=n_blocks)

/home/shujie/pmda/pmda/test/test_rdf_s.py:109: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/shujie/pmda/pmda/parallel.py:359: in run
    res = blocks.compute(**scheduler_kwargs)
/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/dask/base.py:156: in compute
    (result,) = compute(self, traverse=False, **kwargs)
/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/dask/base.py:395: in compute
    results = schedule(dsk, keys, **kwargs)
/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/distributed/client.py:2230: in get
    direct=direct)
/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/distributed/client.py:1593: in gather
    asynchronous=asynchronous)
/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/distributed/client.py:647: in sync
    return sync(self.loop, func, *args, **kwargs)
/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/distributed/utils.py:277: in sync
    six.reraise(*error[0])
/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/six.py:693: in reraise
    raise value
/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/distributed/utils.py:262: in f
    result[0] = yield future
/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/tornado/gen.py:1099: in run
    value = future.result()
/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/tornado/gen.py:1107: in run
    yielded = self.gen.throw(*exc_info)
/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/distributed/client.py:1469: in _gather
    traceback)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

tp = <class 'distributed.scheduler.KilledWorker'>, value = None, tb = None

    def reraise(tp, value, tb=None):
        try:
            if value is None:
                value = tp()
            if value.__traceback__ is not tb:
                raise value.with_traceback(tb)
>           raise value
E           distributed.scheduler.KilledWorker: ('_dask_helper-0b781621-ac0a-4bda-b691-1507e7587020', 'tcp://127.0.0.1:36055')

/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/six.py:693: KilledWorker
----------------------------- Captured stderr call -----------------------------
distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/MDAnalysis-0.19.1.dev0-py3.6-linux-x86_64.egg/MDAnalysis/core/groups.py", line 119, in _unpickle
    u = _ANCHOR_UNIVERSES[uhash]
  File "/home/shujie/anaconda3/envs/py36/lib/python3.6/weakref.py", line 137, in __getitem__
    o = self.data[key]()
KeyError: UUID('38eaec10-83dd-4648-aca7-5e16ce9f7196')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/distributed/protocol/core.py", line 131, in loads
    value = _deserialize(head, fs, deserializers=deserializers)
  File "/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/distributed/protocol/serialize.py", line 178, in deserialize
    return loads(header, frames)
  File "/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/distributed/protocol/serialize.py", line 57, in pickle_loads
    return pickle.loads(b''.join(frames))
  File "/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 59, in loads
    return pickle.loads(x)
  File "/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/MDAnalysis-0.19.1.dev0-py3.6-linux-x86_64.egg/MDAnalysis/core/groups.py", line 127, in _unpickle
    for k in _ANCHOR_UNIVERSES.keys()])))
RuntimeError: Couldn't find a suitable Universe to unpickle AtomGroup onto with Universe hash '38eaec10-83dd-4648-aca7-5e16ce9f7196'.  Available hashes: 
distributed.core - ERROR - Couldn't find a suitable Universe to unpickle AtomGroup onto with Universe hash '38eaec10-83dd-4648-aca7-5e16ce9f7196'.  Available hashes: 
Traceback (most recent call last):
  File "/home/shujie/anaconda3/envs/py36/lib/python3.6/site-packages/MDAnalysis-0.19.1.dev0-py3.6-linux-x86_64.egg/MDAnalysis/core/groups.py", line 119, in _unpickle
    u = _ANCHOR_UNIVERSES[uhash]
  File "/home/shujie/anaconda3/envs/py36/lib/python3.6/weakref.py", line 137, in __getitem__
    o = self.data[key]()
KeyError: UUID('38eaec10-83dd-4648-aca7-5e16ce9f7196')

As mentioned in issue #76, current solution is to avoid setting atomgroup or universe as self attribute.

version of dask

dask 0.19.4

parallel dihedral featurizer

Parallelize obtaining a timeseries of dihedral angles, e.g.

  • backbone angles (phi, psi, omega)
  • user-defined angles

This will be useful for Ramachandran maps and dihedral PCA and other dimensionality reduction.

A parallel version of InterRDF_s

Current version of pmda.rdf doesn't contain a parallelled mdanalysis.analysis.rdf.InterRDF_s, which is a tool to calculate site-specific radial distribution function. We need to add this to the current pmda.rdf.

LeafletFinder issues

Very quick issue with varying things that came up during PR #66; apologies for the messy report. See PR #81 for initial (failing) tests.

n_jobs

LeafletFinder with n_jobs == 2 does not pass tests, see #66 (comment)

  • Currently XFAIL in #66 and master

distributed

LeafletFinder with scheduler as distributed.client fails , see also started PR #81.

______________________________________________ TestLeafLet.test_leaflet_single_frame[distributed-2-1] _______________________________________________

self = <test_leaflet.TestLeafLet object at 0xd26ea5fd0>, u_one_frame = <Universe with 5040 atoms>
correct_values_single_frame = [array([   1,   13,   25,   37,   49,   61,   73,   85,   97,  109,  121,
        133,  145,  157,  169,  181,  193,  ..., 4477, 4489,
       4501, 4513, 4525, 4537, 4549, 4561, 4573, 4585, 4597, 4609, 4621,
       4633, 4645, 4657, 4669])]
n_jobs = 1, scheduler = <Client: scheduler='tcp://127.0.0.1:56156' processes=2 cores=4>

    @pytest.mark.parametrize('n_jobs', (-1, 1, 2))
    def test_leaflet_single_frame(self,
                                  u_one_frame,
                                  correct_values_single_frame,
                                  n_jobs,
                                  scheduler):
        lipid_heads = u_one_frame.select_atoms("name PO4")
        u_one_frame.trajectory.rewind()
        leaflets = leaflet.LeafletFinder(u_one_frame,
                                         lipid_heads).run(start=0, stop=1,
                                                          n_jobs=n_jobs,
>                                                         scheduler=scheduler)

/Volumes/Data/oliver/Biop/Projects/Methods/MDAnalysis/pmda/pmda/test/test_leaflet.py:67:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Volumes/Data/oliver/Biop/Projects/Methods/MDAnalysis/pmda/pmda/leaflet.py:295: in run
    cutoff=cutoff)
/Volumes/Data/oliver/Biop/Projects/Methods/MDAnalysis/pmda/pmda/leaflet.py:205: in _single_frame
    Components = parAtomsMap.compute(**scheduler_kwargs)
/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/dask/base.py:155: in compute
    (result,) = compute(self, traverse=False, **kwargs)
/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/dask/base.py:392: in compute
    results = schedule(dsk, keys, **kwargs)
/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/distributed/client.py:2308: in get
    direct=direct)
/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/distributed/client.py:1647: in gather
    asynchronous=asynchronous)
/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/distributed/client.py:665: in sync
    return sync(self.loop, func, *args, **kwargs)
/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/distributed/utils.py:277: in sync
    six.reraise(*error[0])
/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/six.py:693: in reraise
    raise value
/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/distributed/utils.py:262: in f
    result[0] = yield future
/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/tornado/gen.py:1099: in run
    value = future.result()
/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/tornado/gen.py:1107: in run
    yielded = self.gen.throw(*exc_info)
/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/distributed/client.py:1492: in _gather
    traceback)
/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/six.py:692: in reraise
    raise value.with_traceback(tb)
/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/distributed/protocol/pickle.py:59: in loads
    return pickle.loads(x)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   for k in _ANCHOR_UNIVERSES.keys()])))
E   RuntimeError: Couldn't find a suitable Universe to unpickle AtomGroup onto with Universe hash 'f065a285-b5d1-44db-a2e9-c1de8b73c716'.  Available hashes:

/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/MDAnalysis/core/groups.py:127: RuntimeError
--------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------
distributed.worker - WARNING - Could not deserialize task
Traceback (most recent call last):
  File "/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/MDAnalysis/core/groups.py", line 119, in _unpickle
    u = _ANCHOR_UNIVERSES[uhash]
  File "/Users/oliver/anaconda3/envs/pmda/lib/python3.6/weakref.py", line 137, in __getitem__
    o = self.data[key]()
KeyError: UUID('f065a285-b5d1-44db-a2e9-c1de8b73c716')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/distributed/worker.py", line 1387, in add_task
    self.tasks[key] = _deserialize(function, args, kwargs, task)
  File "/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/distributed/worker.py", line 801, in _deserialize
    function = pickle.loads(function)
  File "/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 59, in loads
    return pickle.loads(x)
  File "/Users/oliver/anaconda3/envs/pmda/lib/python3.6/site-packages/MDAnalysis/core/groups.py", line 127, in _unpickle
    for k in _ANCHOR_UNIVERSES.keys()])))
RuntimeError: Couldn't find a suitable Universe to unpickle AtomGroup onto with Universe hash 'f065a285-b5d1-44db-a2e9-c1de8b73c716'.  Available hashes:

(complete error message from pytest)

pip installation not working

Expected behaviour

installation works

Actual behaviour

ImportError: No module named 'pmda'

Code to reproduce the behaviour

conda create -n test --yes
source activate test
pip install pmda
python -c 'import pmda'

Allow different parallelization schemes/frameworks

Currently, pmda assumes that all problems can be solved with a map-reduce scheme that maps over the frames. This scales well for a lot of problems. If an analysis type requires a lot of work per frame it might make sense to also parallelize the work on a single frame. This is not supported in the current.
I would, therefore, suggest adding a new base class FrameMapReduceBase that is a direct copy of ParallelAnalysisBase and replace the later. The structure would then be.

class ParallelAnalysisBase:
    """ define public API for *any* pmda class """
    def run(self, get=None, scheduler=None, n_jobs=-1, *args, **kwargs)
        if get is not None:
           scheduler = get
        self._run(scheduler=scheduler, n_jobs=n_jobs, *args, **kwargs)
        return self

    def _run(self, scheduler=None, n_jobs=-1, *args, **kwargs):
        raise NotImplementedError

class FrameMapReduceBase(ParallelAnalysisBase):
    """ Implement Map-Reduce and 'map' over frames """
    def _run(self, scheduler=None, n_jobs=-1, *args, **kwargs):
       # current implementation

This would better document the parallelization scheme we use and be very explicit about what functions we expect for a user-facing API. Right now this would only be the run method and it should return a reference to the class. If an analysis uses multiple-cores but a different scheme to generate the DAG-graph like #47 it would only need to inherit from ParallelAnalysisBase and is free to implement it's own DAG.

RDF tests fail

Expected behaviour

No changes were made to the RDF code in PMDA. Tests for RDF should pass, as they have done before.

Actual behaviour

Test fail now.

Maybe something changed in upstream MDAnalysis?

Code to reproduce the behaviour

e.g. https://travis-ci.org/MDAnalysis/pmda/jobs/592141990

____________________ test_density[True-13275.775528444701] _____________________
u = <Universe with 43480 atoms>
sels = [[<AtomGroup with 1 atom>, <AtomGroup with 2 atoms>], [<AtomGroup with 2 atoms>, <AtomGroup with 2 atoms>]]
density = True, value = 13275.775528444701
    @pytest.mark.parametrize("density, value", [
        (True, 13275.775528444701),
        (False, 0.021915460340071267)])
    def test_density(u, sels, density, value):
        rdf = InterRDF_s(u, sels, density=density).run()
>       assert_almost_equal(max(rdf.rdf[0][0][0]), value)
E       AssertionError: 
E       Arrays are not almost equal to 7 decimals
E        ACTUAL: 13275.775440503656
E        DESIRED: 13275.775528444701
/home/travis/build/MDAnalysis/pmda/pmda/test/test_rdf_s.py:117: AssertionError

Currently version of MDAnalysis:

current master

default vstack in _conclude() fails with scalar per frame data

Expected behaviour

I can just follow the typical example to make analysis from a single frame function such as

    import MDAnalysis as mda
    import MDAnalysisData
    data = MDAnalysisData.datasets.fetch_nhaa_equilibrium()

    u = mda.Universe(data.topology, data.trajectory)
    protein = u.select_atoms('protein')

    def rgyr(ag):
        return  ag.radius_of_gyration()


    import pmda.custom
    parallel_rgyr = pmda.custom.AnalysisFromFunction(rgyr, u, protein)

    parallel_rgyr.run(n_jobs=4, n_blocks=4)
    print(parallel_rgyr.results)

Actual behaviour

The above fails with

--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-30-27c1019c5274> in <module>
----> 1 parallel_rgyr.run(n_blocks=4)

~/MDAnalysis/pmda/pmda/parallel.py in run(self, start, stop, step, n_jobs, n_blocks)
    374             with timeit() as conclude:
    375                 self._results = np.asarray([el[0] for el in res])
--> 376                 self._conclude()
    377 
    378         self.timing = Timing(

~/MDAnalysis/pmda/pmda/custom.py in _conclude(self)
    101 
    102     def _conclude(self):
--> 103         self.results = np.vstack(self._results)
    104 
    105 

~/anaconda3/envs/pmda/lib/python3.6/site-packages/numpy/core/shape_base.py in vstack(tup)
    281     """
    282     _warn_for_nonsequence(tup)
--> 283     return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
    284 
    285 

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Code to reproduce the behaviour

See above.

The problem is that the arrays in _results do not have the same lengths

>>> [len(a) for a in parallel_rgyr._results]
[1251, 1250, 1250, 1250]

and np.vstack() does not like this. Instead one would need np.hstack(). However, as soon as the return value of the function is a tuple, vstack() works as expected.

Currently version of MDAnalysis:

  • PMDA 0.2.1
  • MDAnalysis 0.19.2
  • dask 1.2.2

allow using updating atomgroups

It's currently not possible to use updating atomgroups because we use the indices of an atomgroup to rebuild it for a thread. We don't even show a warning right now that updating atomgroups are not supported.

Convenient test for different schedulers

pmda can be used with different dask schedulers and in a single core variant. It would be nice if we have a pytest.fixture that is easy to use and runs a single test with all supported schedulers.

from pmda.testing import scheduler

def test_custom(scheduler):
    a = AnalysisFromFunction(...).run(scheduler=scheduler)

I think it should be possible to parametrize over a fixture.

add timing log

It will be useful to know the time taken for

  • running the compute task
  • trajectory I/O
  • total time

This can guide users in choosing optimal parallelization schemes.

replace unsupported `get` keyword with `scheduler`

Expected behaviour

No warnings are raised when using dask.

PMDA should work with latest dask releases ≥ 0.18.0

Actual behaviour

Using the get= kwarg is deprecated removed (raises TypeError), instead use scheduler=

pmda/test/test_contacts.py::TestContacts::()::test_startframe
  /home/travis/miniconda/envs/test/lib/python3.6/site-packages/dask/base.py:835: UserWarning: The get= keyword has been deprecated. Please use the scheduler= keyword instead with the name of the desired scheduler like 'threads' or 'processes'
    warnings.warn("The get= keyword has been deprecated. "

This is now a TypeError.

Code to reproduce the behaviour

Currently version of MDAnalysis:

(run python -c "import MDAnalysis as mda; print(mda.__version__)")
(run python -c "import pmda; print(pmda.__version__)")
(run python -c "import dask; print(dask.__version__)")

long description does not render in reST on pypi

https://pypi.org/project/pmda/ does not render nicely in the 0.1.0 release. The content is properly declared but checking the README rendering with readme_renderer

python setup.py check -r -s
/nfs/homes2/oliver/Library/miniconda2/envs/mda3/lib/python3.6/distutils/dist.py:261: UserWarning: Unknown distribution option: 'long_description_content_type'
  warnings.warn(msg)
running check
warning: Check: missing meta-data: if 'author' supplied, 'author_email' must be supplied too

warning: Check: The project's long_description has invalid markup which will not be rendered on PyPI. The following syntax errors were detected:
line 40: Error: Unknown interpreted text role "class".


error: Please correct your package.

showed that we are using Sphinx constructs that the PyPi renderer does not like.

(The reason for these constructs is that we are using the README also as the landing page for the PMDA docs.)

Move RMSF into pmda.rms

Need to create an rms module as in MDAnalysis.analysis.rms for RMSF. Instead of having pmda.rmsf all in one file, we will make a submodule:

rms/__init__.py
rmsd.py
rmsf.py

and in __init__.py we will have:

from __future__ import absolute_import`
from .rmsf import RMSF
from .rmsd import RMSD

__all__ = ["RMSD", "RMSF"]

This will require doc restructuring.

LeafletFinder tests started failing with "MarkDecorator"

Expected behaviour

Tests pass as before.

Actual behaviour

Tests started failing in LEafletFinder with f82e4c6 which had nothing to do with this part of the code.

E       TypeError: unsupported operand type(s) for /: 'int' and 'MarkDecorator'

Full log snippet

See https://travis-ci.org/MDAnalysis/pmda/jobs/529478197

================================== FAILURES ===================================
______________________ TestLeafLet.test_leaflet[n_jobs0] _______________________
self = <test_leaflet.TestLeafLet object at 0x7f0522a67080>
universe = <Universe with 43480 atoms>
correct_values = [array([36507, 36761, 37523, 37650, 38031, 38285]), array([36634]), array([36507, 36761, 38285, 39174]), array([36634]), array([36507, 36761, 37650, 38285, 39174, 39936]), array([36634]), ...]
n_jobs = MarkDecorator(mark=Mark(name='xfail', args=(-1,), kwargs={}))
    @pytest.mark.parametrize('n_jobs', (pytest.mark.xfail(-1),
                                        1,
                                        pytest.mark.xfail(2)))
    def test_leaflet(self, universe, correct_values, n_jobs):
        lipid_heads = universe.select_atoms("name P and resname POPG")
        universe.trajectory.rewind()
        leaflets = leaflet.LeafletFinder(universe, lipid_heads)
>       leaflets.run(n_jobs=n_jobs)
/home/travis/build/MDAnalysis/pmda/pmda/test/test_leaflet.py:49: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/travis/build/MDAnalysis/pmda/pmda/leaflet.py:301: in run
    cutoff=cutoff)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <pmda.leaflet.LeafletFinder object at 0x7f0522075e48>
ts = < Timestep 0 with unit cell dimensions [102.844894 102.84479  132.1866    90.        90.       120.      ] >
atomgroups = <AtomGroup with 55 atoms>
scheduler_kwargs = {'scheduler': 'single-threaded'}
n_jobs = MarkDecorator(mark=Mark(name='xfail', args=(-1,), kwargs={}))
cutoff = 15.0
    def _single_frame(self, ts, atomgroups, scheduler_kwargs, n_jobs,
                      cutoff=15.0):
        """Perform computation on a single trajectory frame.
    
        Must return computed values as a list. You can only **read**
        from member variables stored in ``self``. Changing them during
        a run will result in undefined behavior. `ts` and any of the
        atomgroups can be changed (but changes will be overwritten
        when the next time step is read).
    
        Parameters
        ----------
        scheduler_kwargs : Dask Scheduler parameters.
        cutoff : float (optional)
            head group-defining atoms within a distance of `cutoff`
            Angstroms are deemed to be in the same leaflet [15.0]
    
        Returns
        -------
        values : anything
            The output from the computation over a single frame must
            be returned. The `value` will be added to a list for each
            block and the list of blocks is stored as :attr:`_results`
            before :meth:`_conclude` is run. In order to simplify
            processing, the `values` should be "simple" shallow data
            structures such as arrays or lists of numbers.
    
        """
    
        # Get positions of the atoms in the atomgroup and find their number.
        atoms = ts.positions[atomgroups.indices]
        matrix_size = atoms.shape[0]
        arranged_coord = list()
>       part_size = int(matrix_size / n_jobs)
E       TypeError: unsupported operand type(s) for /: 'int' and 'MarkDecorator'
/home/travis/build/MDAnalysis/pmda/pmda/leaflet.py:192: TypeError

release 0.2.1

We can create a 0.2.1 release (for the SciPy manuscript).

We only added an internal additional timer since 0.2.0.

  • tagged
  • GitHub release
  • PyPi release
  • conda-forge release

fails with ChainReader trajectory

Expected behaviour

pmda can deal with any universe/trajectory that can be built with MDAnalysis

Actual behaviour

Fails with TypeError "TypeError: can't pickle generator objects" (see Stack trace below) when the universe.trajectory is a ChainReader

Code to reproduce the behaviour

import MDAnalysis as mda
from MDAnalysis.tests.datafiles import TPR, XTC

import pmda.rms

u = mda.Universe(TPR, 3*[XTC])
print(u.trajectory)
# <ChainReader ['adk_oplsaa.xtc', 'adk_oplsaa.xtc', 'adk_oplsaa.xtc'] with 30 frames of 47681 atoms>

protein = u.select_atoms("protein and not name H*")
R = pmda.rms.RMSD(protein, protein).run(n_jobs=4)

Currently version of MDAnalysis:

(run python -c "import MDAnalysis as mda; print(mda.__version__)")
(run python -c "import pmda; print(pmda.__version__)")
(run python -c "import dask; print(dask.__version__)")

  • MDAnalysis: 0.18.1-dev
  • pmda: 0.1.1
  • dask: 0.17.5

Stack trace

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-aced2a5b0b3e> in <module>()
      1 protein = u.select_atoms("protein and not name H*")
----> 2 R = pmda.rms.RMSD(protein, protein).run(n_jobs=4)

~/Library/miniconda2/envs/mda3/lib/python3.6/site-packages/pmda/parallel.py in run(self, start, stop, step, scheduler, n_jobs, n_blocks)
    257                 blocks.append(task)
    258             blocks = delayed(blocks)
--> 259             res = blocks.compute(**scheduler_kwargs)
    260             self._results = np.asarray([el[0] for el in res])
    261             self._conclude()

~/Library/miniconda2/envs/mda3/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
    152         dask.base.compute
    153         """
--> 154         (result,) = compute(self, traverse=False, **kwargs)
    155         return result
    156 

~/Library/miniconda2/envs/mda3/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
    405     keys = [x.__dask_keys__() for x in collections]
    406     postcomputes = [x.__dask_postcompute__() for x in collections]
--> 407     results = get(dsk, keys, **kwargs)
    408     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    409 

~/Library/miniconda2/envs/mda3/lib/python3.6/site-packages/dask/multiprocessing.py in get(dsk, keys, num_workers, func_loads, func_dumps, optimize_graph, **kwargs)
    175                            get_id=_process_get_id, dumps=dumps, loads=loads,
    176                            pack_exception=pack_exception,
--> 177                            raise_exception=reraise, **kwargs)
    178     finally:
    179         if cleanup:

~/Library/miniconda2/envs/mda3/lib/python3.6/site-packages/dask/local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
    506             # Seed initial tasks into the thread pool
    507             while state['ready'] and len(state['running']) < num_workers:
--> 508                 fire_task()
    509 
    510             # Main loop, wait on tasks to finish, insert new ones

~/Library/miniconda2/envs/mda3/lib/python3.6/site-packages/dask/local.py in fire_task()
    500                 # Submit
    501                 apply_async(execute_task,
--> 502                             args=(key, dumps((dsk[key], data)),
    503                                   dumps, loads, get_id, pack_exception),
    504                             callback=queue.put)

~/Library/miniconda2/envs/mda3/lib/python3.6/site-packages/dask/multiprocessing.py in _dumps(x)
     28 
     29 def _dumps(x):
---> 30     return cloudpickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
     31 
     32 

~/Library/miniconda2/envs/mda3/lib/python3.6/site-packages/cloudpickle/cloudpickle.py in dumps(obj, protocol)
    893     try:
    894         cp = CloudPickler(file, protocol=protocol)
--> 895         cp.dump(obj)
    896         return file.getvalue()
    897     finally:

~/Library/miniconda2/envs/mda3/lib/python3.6/site-packages/cloudpickle/cloudpickle.py in dump(self, obj)
    266         self.inject_addons()
    267         try:
--> 268             return Pickler.dump(self, obj)
    269         except RuntimeError as e:
    270             if 'recursion' in e.args[0]:

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in dump(self, obj)
    407         if self.proto >= 4:
    408             self.framer.start_framing()
--> 409         self.save(obj)
    410         self.write(STOP)
    411         self.framer.end_framing()

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save_tuple(self, obj)
    734         if n <= 3 and self.proto >= 2:
    735             for element in obj:
--> 736                 save(element)
    737             # Subtle.  Same as in the big comment below.
    738             if id(obj) in memo:

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save_tuple(self, obj)
    749         write(MARK)
    750         for element in obj:
--> 751             save(element)
    752 
    753         if id(obj) in memo:

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

~/Library/miniconda2/envs/mda3/lib/python3.6/site-packages/cloudpickle/cloudpickle.py in save_instancemethod(self, obj)
    665         else:
    666             if PY3:
--> 667                 self.save_reduce(types.MethodType, (obj.__func__, obj.__self__), obj=obj)
    668             else:
    669                 self.save_reduce(types.MethodType, (obj.__func__, obj.__self__, obj.__self__.__class__),

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    608         else:
    609             save(func)
--> 610             save(args)
    611             write(REDUCE)
    612 

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save_tuple(self, obj)
    734         if n <= 3 and self.proto >= 2:
    735             for element in obj:
--> 736                 save(element)
    737             # Subtle.  Same as in the big comment below.
    738             if id(obj) in memo:

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    519 
    520         # Save the reduce() output and finally memoize the object
--> 521         self.save_reduce(obj=obj, *rv)
    522 
    523     def persistent_id(self, obj):

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    632 
    633         if state is not None:
--> 634             save(state)
    635             write(BUILD)
    636 

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save_dict(self, obj)
    819 
    820         self.memoize(obj)
--> 821         self._batch_setitems(obj.items())
    822 
    823     dispatch[dict] = save_dict

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in _batch_setitems(self, items)
    845                 for k, v in tmp:
    846                     save(k)
--> 847                     save(v)
    848                 write(SETITEMS)
    849             elif n:

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    519 
    520         # Save the reduce() output and finally memoize the object
--> 521         self.save_reduce(obj=obj, *rv)
    522 
    523     def persistent_id(self, obj):

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    632 
    633         if state is not None:
--> 634             save(state)
    635             write(BUILD)
    636 

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save_dict(self, obj)
    819 
    820         self.memoize(obj)
--> 821         self._batch_setitems(obj.items())
    822 
    823     dispatch[dict] = save_dict

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in _batch_setitems(self, items)
    845                 for k, v in tmp:
    846                     save(k)
--> 847                     save(v)
    848                 write(SETITEMS)
    849             elif n:

~/Library/miniconda2/envs/mda3/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    494             reduce = getattr(obj, "__reduce_ex__", None)
    495             if reduce is not None:
--> 496                 rv = reduce(self.proto)
    497             else:
    498                 reduce = getattr(obj, "__reduce__", None)

TypeError: can't pickle generator objects

updating AtomGroups are silently ignored and don't work

Expected behaviour

If I write a custom function with updating AtomGroups then PMDA should use it just as MDAnalysis does.

uag = u.select_atoms("resname SOD and around 3 (name OD*)", updating=True)
counter = pmda.custom.AnalysisFromFunction(lambda ag: ag.n_atoms, u, uag)
counter.run()

At a minimum, it should fail when handed an updating AG.

Actual behaviour

Only the initial selection is used (because PMDA communicates indices, not AtomGroups).

Code to reproduce the behaviour

see above

Currently version of MDAnalysis:

MDA 0.19.0, pmda 1.1+dev

online documentation

You can copy the files from pytng or RotamerConvolve to get the simple standard layout and color scheme. (Perhaps we should have a repo just for the basic doc layout for easier cloning.)

It should appear as https://www.mdanalysis.org/pmda

store block information

Similar to #86: store information to get the frames in each block, e.g., store the slices from make_balanced_blocks() in the class. Have a managed attribute _blocks where the slices are applied to all trajectory frame indices when someone wants them.

Example of what the user should get back: list of arrays:

_blocks = [
        [0, 1, 2, 3],       # block 0
        [4, 5, 6, 7],       # block 1
        [8, 9, 10],         # block 2
]

Add IO and computation time per block in timing

Expected behaviour

I/O and computation time per block should be stored in the timing object so that one can easily know the components of the total time used for each block.

Actual behaviour

There are only I/o and computation time per frame saved in timing object.

Currently version of MDAnalysis:

MDAnalysis 0.19.2
PMDA 0.2.1
dask 1.2.2

DensityAnalysis Normalization

The current density function may have an issue with normalization. Since DensityAnalysis uses the MDAnalysis density function, it is currently unclear whether MDAnalysis.density.density_from_universe() is incorrect or the implementation of density_from_universe() into pmda is incorrect. We need to compare the results from a run of MDAnalysis.analysis.density.density_from_universe() to pmda.density.DensityAnalysis for other trajectories to see if they agree, as well as find a 3rd party method of calculating densities to compare values with.

do not use dask if no scheduler is given and `n_jobs=1`

Currently we use dask even to a serial job. This can make debugging quite hard since it breaks the python debugger expcetations. If we setup that serial jobs run without dask it will be easier to people to develop new algorithms and debug them with the python debugger.

Fail to assign trajectory to the last block

Expected behaviour

Each block should be assigned trajectories.

Actual behaviour

Fail to assign trajectory to the last block when the number of total frames meets some condition.

Code to reproduce the behaviour

from MDAnalysisTests.datafiles import GRO_MEMPROT, XTC_MEMPROT
import MDAnalysis as mda
from pmda import rms

u=mda.Universe(GRO_MEMPROT, XTC_MEMPROT)
protein=u.select_atoms('protein') 
rmsd = rms.RMSD(protein, protein)
rmsd.run(n_jobs = n)

Whre XTC_MEMPROT is a trajectory which has 5 frames.
It works when n= 1,2,3.
If n=4, it will give an error:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Which is caused by the empty list returned from the last block.

To explain why the result from the last block is empty, we should first look at parallel.py.
The lengh of the trajectory assigned to each block is determined by
bsize = int(np.ceil(n_frames / float(n_blocks))).

If n_frames=5 and n_blocks=4, bsize=2, and the four blocks will be assiged 2, 2 , 1, 0 frames. No frame is assigned to the last one.

This issue happens when (a-1)n_blocks < n_frames <= a(n_blocks-1)
for all possitive integer a < n_blocks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.