markovmodel / adaptivemd Goto Github PK

View Code? Open in Web Editor NEW

18.0 9.0 6.0 3.95 MB

A python framework to run adaptive Markov state model (MSM) simulation on HPC resources

License: GNU Lesser General Public License v2.1

Python 99.34% Shell 0.65% HTML 0.01%

python markov-model hpc adaptive molecular-dynamics-simulation

adaptivemd's Introduction

AdaptiveMD

A Python framework to run adaptive MD simulations using Markov State Model (MSM) analysis on HPC resources.

See below for a simple installation
Configure & run install_admd.sh to deploy on HPC or cluster

The generation of MSMs requires a huge amount of trajectory data to be analyzed. In most cases this leads to an enhanced understanding of the dynamics of the system, which can be used to make decisions about collecting more data to achieve a desired accuracy or level of detail in the generated MSM. This alternating process between simulation to actively generate new observations & analysis is currently difficult and involves human decision along the path.

AdaptiveMD aims to simplify this process with the following design goals:

Ease of use: Simple system setup once an HPC resource has been added.
Flexibility: Modular setup of multiple HPCs and different simulation engines
Automatism: Create a user-defined adaptive strategy that is executed
Compatibility: Build analysis tools and export to known formats

After installation, you might want to start working with the examples in examples/tutorials. You will first learn the basics of how tasks are created and executed, and then more on composing workflows.

Prerequisites

There are a few components we need to install for AdaptiveMD to work. If you are installing in a regular workstation environment, you can follow the instructions below or use the installer script. If you are installing in a cluster or HPC environment, we recommend you use the script install_admd.sh. The instructions here in the README should give you the gist of what's going on in the installer, which sets up the same components but does more configuration of the environment used byAdaptiveMD.

AdaptiveMD creates task descriptions that can be executed using a native worker object or via the RADICAL-Pilot execution framework. """RP install configuration on or off"""

MongoDB

AdaptiveMD needs access to a MongoDB. If you want to store project data locally you need to install MongoDB. Both your user machine and compute resource (where tasks are executed) must see a port used by the database.

MongoDB Community Edition will provide an installer for your OS, just download and follow the installation instructions. Depending on the compute resource network restrictions, it might be necessary to install the database in different locations for production workflows.

For linux systems:

curl -O https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-debian81-3.4.2.tgz
tar -zxvf mongodb-linux-x86_64-debian81-3.4.2.tgz
mkdir ~/mongodb
mv mongodb-linux-x86_64-debian81-3.4.2/ ~/mongodb
# add mongodb binaries to PATH in .bashrc
echo "export PATH=~/mongodb/mongodb-linux-x86_64-debian81-3.4.2/bin/:\$PATH" >> ~/.bashrc
# create parent directory for database
mkdir -p ~/mongodb/data/db
# run a `mongod` deamon in the background
mongod --quiet --dbpath ~/mongodb/data/db &

Conda

We recommend using contained python environments to run AdaptiveMD. The AdaptiveMD application environment can be considered separate from the task-execution environment, but for simplicity in these instructions we will load the same environment in tasks that is used when running the application. This means that AdaptiveMD, along with the task 'kernels' OpenMM and PyEMMA are all installed in a single environment.

Conda provides a complete and self-contained python installation that prevents all sorts of problems with software installation & compatibility, so we recommend it to get started. When you move to a cluster or HPC environment, it will likely be a better choice to use virtualenv as your environment container. It uses an existing python installation with env-specific libraries, and is thus faster to load and more scalable, but does not resolve installation issues as thoroughly.

If you do not yet have conda installed, do so using:

curl -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Add 2 useful channels

conda config --append channels conda-forge
conda config --append channels omnia

--append will make sure that the regular conda packages are tried first, then use conda-forge and omnia as a fallback.

Install required packages now:

# be sure the conda python version is fully updated
conda install python
# create a new conda environment for all the installations
conda create -n admdenv python=2.7
source activate admdenv

Now installing adaptiveMD related packages. Note you must be inside the python environment you will be working in when installing the packages (or use: conda install -n [packages...]):

# jupyter notebook for tutorials and project work
conda install jupyter
# to prep for adaptivemd install
conda install pyyaml
# for simulations & analysis
# since we're using same env for tasks
conda install pyemma openmm

Install AdaptiveMD

Let's get adaptivemd from the github repo now.

# clone and install adaptivemd 
git clone https://github.com:markovmodel/adaptivemd.git

# go to adativemd and install it
cd adaptivemd/
python setup.py install
#OR
#pip install .
# see if we pass the import test
python -c "import adaptivemd" || echo 'FAILED'

# run a simple test
cd adaptivemd/tests/
#FIXME
python test_simple.py

pyemma and openmm should be installed on the compute resource as well as the local machine. It is possible to exclude, say, openmm from the local install if simulations will only be run on the resource.

That's it. Have fun running adaptive simulations.

Documentation

To compile the doc pages, clone this github repository, go into the docs folder and do

conda install sphinx sphinx_rtd_theme pandoc
make html

The HTML pages are in _build/html. Please note that the docs can only be compiled if all the above mentionend AdaptiveMD dependencies are available. If you are using conda environments, this means that your AdaptiveMD environment should be active.

adaptivemd's People

Contributors

Stargazers

Watchers

Forkers

jhprinz thempel dwhswenson jro1234 chemlove

adaptivemd's Issues

MSM analysis worker

I just had the scenario that an analysis failed because of the DB file limits mentioned in #26 followed by a worker crash (walltime). After restarting everything, the worker printed

Queued a task [PythonTask] from generator `pyemma`
task did not complete

which apparently was caused by

print task.stderr.message
ln: die symbolische Verknüpfung „input.json“ konnte nicht angelegt werden: Die Datei existiert bereits
ln: die symbolische Verknüpfung „_run_.py“ konnte nicht angelegt werden: Die Datei existiert bereits
ln: die symbolische Verknüpfung „input.pdb“ konnte nicht angelegt werden: Die Datei existiert bereits
ln: die symbolische Verknüpfung „_file__rpc_output_748c032e-3ff1-4391-845b-a01cdea3796d.json“ konnte nicht angelegt werden: Die Datei existiert bereits

mongodb available in Anaconda distribution

I've just noticed, that mongodb is now available for conda install. Should we update the install instructions accordingly for the sake of an easier installation process?
Note that the available version is 3.3.8, I don't know if that version is sufficient.

Support Gromacs

This is interesting for @nuria. Since she already has script to run gromacs we could support these in adaptivemd. That should be a quick addition.

Bug with reloading a project?

After reloading a project, the project.trajectories object is empty. I'm not sure if I caused this myself by creating the engine and modeller objects again instead of loading them. Probably that's the problem, so maybe it would be good to have an error message or something. Since the files are still in place, it would also be nice to fix that somehow.

Edit: If I load all the objects like engine and modeller, trying to output the trajectories with list(project.trajectories) results in a TypeError.

Worker stdout handling

It would be great to have an option to see the actual worker stdout/stderr for debugging purposes.

Support AceMD

If someone is interested I can hack something together. Maybe sebastian.

Tutorial part 4

Refers to:
https://github.com/markovmodel/adaptivemd/blob/master/examples/tutorial/4_example_advanced_tasks.ipynb
But addresses also general package/API issues along the way.

I'll have a number of questions / comments about the details of this notebook, but first of all I'm missing the overall picture that I think should be given here, so let me just start by a general question:

General:

My understanding of a task is that it is run by the worker, so it can access files from the "staging area" (that I think is just the project storage, correct?), copy or link them to the worker directory, run things there (e.g. MD trajectory or PyEMMA analysis). Now the main task of the adaptive sampling package is to establish a communication between the workers and the decision-maker (strategy or 'brain'). My understanding is that this is done via the database, i.e. workers need to write information (in the form of models, or some compressed trajectory statistics, or other metadata) into the database, while the strategy needs to retrieve this information from the database in order to make the decision what to do next. The information how to do this is completely lacking so far. Can you write the notebooks - or alternatively the doc pages - with this in mind? I think then we can go more into details.
I really like the way you solved file transfer on the task level, this looks very powerful and general.

Details (incomplete):

There are still references to RP in this notebook. I guess these are out of date.
sometimes there are three slashes in the filenames ///, not sure if this intended and what it means
There's this project.trajectories.one that I have no clue what it means.
Lines such as Transfer('file://{}/my_rpc_function.py' > 'worker://my_rpc_function.py). I'm not sure what this is - some meta-language that is parsed and translated into an actual bash command? I'm not sure why not just use bash at this stage, because other lines are just bash (like the echo example). Even in a meta-language, why use a > to denote copy direction, if you want to make it look like a function call, I'd rather use a comma separating the two arguments, otherwise it looks like a pipe.

PyEMMA not preserving input file order?

Does anyone know that? If I pass several files to pyemma.coordinates.source, will it preserve the order and return .lengths and dtrajs in the same order?

I have a case where this is not the case and hence I cannot make sure that I assign correct states to the input trajectories.

mongodb file size limitations

I just came across a pymongo.errors.DocumentTooLarge error on a very small dataset, probably because I accidentally produced a huge transition matrix. We might need to solve this problem sooner or later, especially because the discrete trajectories become larger and larger with time...

API change Staging for generators is also a simple task

I think that makes sense in our case, but not necessarily for RP compatibility.

What is the problem?

Currently a generator (openmmengine, etc.) oft requires the same files for every run. Like openmm needs system.xml, a .pdb file etc. And of course we do not always create these files again, but just store these once on the cluster and then use the staging_area which is this location. So, the stagin_area are files that we need to copy once and then access often, but are not results.

RP allows you to run a special stage_in command to push these files to the cluster before you run tasks. For we run into the problem that now every worker will copy these files to make sure that they exist. And we got already problems that require potential creating the same file several times and need try-except clauses to make sure this works.

There are 2 choices to fix that.

Forget about the staging area and just always copy the file from the DB (which is already possible)
Use the staging_area, but consider this to be a separate task. Practically the first one and all others depend on it. And the first running worker will pick the staging task. Create all staging files and once this is done. We can continue with the regular tasks. Still need to think about this, if there are other problems, but this seems more consistent and simpler.

Simulation / Model / Analysis Workflow

@nsplattner @thempel and me were discussing the general recommended flow of data and this is somehow related to the questions in #23 .

Question

In #23 I asked about the structure of reduced trajectories (multiple-files...) so that PyEMMA or another analysis can be always used. Now, we the new directory approach there is no fixed structure and hence the framework cannot guess, what do to with the trajectories. Which files in the directory to use, etc... Still, the Trajectory objects, will have information about strides, but I guess that an engine will subclass from Trajectory to add certain information that is needed for restart with the engines particular way of storing things.

It means, that an engine writing the files in a trajectory folder, could also add information about filenames, etc to the Trajectory object it returns. We could agree that an engine needs to provide functions or bash snippets to extract a frame from such a trajectory. A trajectory know its generating engine and so a trajectory would have access to code that can extract frames, etc.

So, in theory it would be possible to write the trajectory analysis independent of the engine that generated the data. That was my original approach, but I guess that this will not reflect the way, things are currently done by people. Everyone wants something specific for whatever reason and so to trivial solution is, that

1. Trajectory generation and trajectory analysis goes in pairs.

You need to pass exactly the files PyEMMA needs and tell PyEMMA about the stride, etc you used to generate these. The downside is that code becomes less reuseable and hence easier to screw up.
This is easy because everyone writes their own code.

2. Write engine specific functions to read trajectories into pyemma

Hmmm, that would mean to add analysis specific code to the engine and I would really like to keep these separate. Still, it could make sense to have functions that allow you get certain files for certain aspects

t = Trajectory(...)
reduced_traj = engine.get_reduced(t)  # find the file for the reduced traj
full_traj = engine.get_full(t)  # find the file for the full traj

# this would be the normal way and you need to know `reduced1.dcd` as filename
reduced_trajs_for_analysis = project.trajectories.all.to_path('reduced1.dcd')

3. Use feature trajectories

This is what we discussed and could make sense. Instead of re-writing the PyEMMA input you need to write an engine specific featurizer which could be much simpler. It will also cache features for all trajectories. Useful, if these are expensive to compute but cheap to store.

It requires an intermediate featurization step, but then you just pass featurized trajectories to PyEMMA

In this approach we still need to figure out on where to store the feature_trajs. Could be in the trajectory folder, since this needs to exists before you can compute features.

IDEAS?

Additional Features

Just a list of features we think are useful...

support for FeatureTrajectories (I think I had this almost figured out and ready). A clean way to generate Trajectory Files that only contain features besides existing trajectories. Analysis should then be able to work on both types of trajectories, feature and atom alike.

THE BIG UNIFICATION

A more on the grand (unifying) project if I would have more time. I guess that would be ultimate simulation machine. Let me lay out what I think would be enough to allow automatic parallel execution of most simulation needs (RepEx, AdaptiveMD, TIS and derivatives, MetaDynamics). What I propose is not about the way the resulting simulations are used or restart decisions but merely what features would be needed to cover all these cases.

Code for this mostly exists in adaptivemd or openpathsampling in some form. If possible I would like to guide OPS to use adaptivemd in a future version.

Simulation Engines / Generators like in adaptivemd
Trajectory Snippets We do not (necessarily) combine trajectories, but keep the simulated parts as trajectory parts (which can have feature trajectories as well). In OPS we have this already, but use trajectories of single frame size. This is extremely flexible, but has a large overhead that slows down analysis. A transition to Trajectory objects that consists of sub trajectories would be desirable.
Trajectory Files like in adaptivemd. We strictly separate the project metadata and trajectory files
Alterable Trajectories like in OPS. You can chop, extend and reverse trajectories to you liking.
Reversible Features like in OPS. Feature must support the concept of reversibility.
Trajectory Features like in upcoming OPS. A trajectory feature is a feature that is some value for the whole trajectory object. Example: the maximal phi angle. This will give in insane speed up for analysis.
Featurizer like in OPS. These are in OPS the connection between trajectories and rest of the framework. You cannot access atom coordinates unless you create a feature for these. You then access the values by feature(trajectory) like applying a function. A feature knows how it is computed (using PyEmma, etc...) and is cached. In adaptivemd it would be an abtraction of the FeatureTrajectory concept, so feature(trajectory) will automatically (1) use the constructed FeatureTrajectories and/or schedule computation of the features if necessary. Optionally we allow storage of certain features in the DB for later analysis (like in OPS)

... more to come ...

Local unit tests

Unit tests fail when running them on my local resource, apparently some paths are not well configured. I get the following errorwith test_simple_wrapped.py and test_simple_wrapped.py:

IOError: [Errno 2] No such file or directory: '/examples/files/alanine/alanine.pdb'

Tests run with py.test -v --pyargs adaptivemd --doctest-modules.

Tutorial part 1

These are comments / requests for tutorial part I:
https://github.com/markovmodel/adaptivemd/blob/master/examples/tutorial/1_example_setup_project.ipynb

Overall I really like the API. Here are a few minor issues and comments:

General: The hardest part in learning an API (also this one) is to learn the terminology. To make this easier, it's important to always stick to the same terminology (mostly the case but not always), and to explain the terminology first. I suggest creating an overview slide like the one you sent me for the presentation (also for the docs) that defines all terminology used in the framework, and then stick to this terminology in the code naming conventions, such as task, engine, storage area, etc. You can link to that doc page from the tutorial notebooks.

from adaptivemd import LocalCluster, AllegroCluster
For a public package, it seems strange to have a class specific to the Machine of one group. How about the following: We encode the configuration of the resource in a config-file, and load that, e.g.
``resource = LocalCluster(‘BUILTIN/allegro.config’)
(Change the naming convention if you find something better). allegro.config is in this case part of the package, but you could specify another file name here. That would also be very useful in case if we have to change the configuation because we have a new SLURM version or something else changes on the cluster. This is general, and we don’t have to publicize the existence of the allegro.config file outside our group.

Cell 6: Does LocalCluster mean you are just running on the local machine where this notebook runs? What about LocalHost? For me local cluster is something like a group cluster.

resource.wrapper I get the idea, but it isn’t intuitive to me that to call this wrapper. Maybe you can explain the idea a bit more or find a different naming. Maybe task_wrapper? Is the idea of this object to wrap the task?

Cell 16:

pdb_file = File('file://../files/alanine/alanine.pdb').named('initial_pdb').load()
should this be ‘name’ or ‘named’ (text before says ‘name’)
unit - since the term unit doesn’t seem to be used elsewhere in the framework, how about calling this “worker”?
staging - the sentence after that has a few grammar errors.
Also I’m not familiar with the term staging. If I understand correctly this is the long-time storage of the project, i.e. where you put all of your trajectory data. What about storage?

Cell 17: This is probably to short notice to change it, but for packages like OpenMM that possess an API it is strange (and introduces unnecessary sources of error) to go through a bash execution - you seem to have an API -> script -> API transition.
In my mind, the Engine should define what and how to run, and a ScriptedEngine could be a special case where the Engine API doesn’t directly map to a call, but to a script generation+execution. Maybe this is how things work already, my comment is just based on what I see in the tutorial.

Cell 19: again name vs named?

Cell 20: not sure why you refer to stores. I guess this is related to storage, but the user probably doesn’t need to know about that.

Section header 3: small typo in 'initial'

Cell 21: The reference to the engine confused me at first, but actually you just take the pdb name from there. I’d change this to pdb_file as everywhere else

Cells 21/22/23: It’s not really clear why the trajectory object is first added to the project, then you create a run-task with this trajectory in the engine, then you queue the task. Seems somehow redundant. What happens if we do not execute cell 21, and why are cells 22/23 not sufficient to establish all required information?

Cancel a running job without shutting down worker

In some cases, it is necessary to cancel a running job. Right now, this can only be done by shutting down and restarting the worker, which especially in our current 4GPU-worker setup can be a bit annoying.
Most convenient would be a task.cancel()-function. However, it will cause some trouble with depending jobs, e.g. with submitted trajectory extensions. Would it make sense to optionally also delete all depending jobs? Or just a user warning?

PyEmma Analysis

For actual adaptive simulations, a flexible way to analyze the data is very important. Apart from choosing input parameters such as the lag time and msm states, it will be necessary to modify e.g. input features. Modifying msmanlyze.py, which is effectively part of the adaptivemd source code, is not very convenient. It might further be a bit risky to give the user the standard analysis for the alanine dipeptide because he must opt-out in order to avoid meaningless yet working results. For more complex types of trajectories, even more options need to be considered. I would suggest to either

have a script-based solution that allows the user to write custom scripts. Maybe adding its path to a modeller object. This would also allow to keep track of which modeller has been used to generate which trajectories.

a function-based solution: I don't know if this even works, but it might be even better to define custom functions for the analysis which must take a given set of input parameters and produce the output in a given shape. I am thinking of something similar to PyEmma's featurizer.add_custom_function(). It would allow to directly see which keyword arguments can be chosen. Further, the function could be stored in the database, I suppose, making it easy to keep track of the used strategy. Might also be easier to add this to the "brain"...

Tutorial 5 not working

Using one's own generator class, i.e. for implementing a pyemma analysis script, does not work as described in notebook 5. First, their are several bugs in the notebook itself. Substituting the given my_generator.py by e.g. (just as a non-sense minimal example, important are the imports...)

from adaptivemd.generator import TaskGenerator
from adaptivemd.task import PythonTask

class MDTrajFeaturizer(TaskGenerator):
    def __init__(self):
        super(MDTrajFeaturizer, self).__init__()

    @staticmethod
    def then_func(project, task, data, inputs):
        # add the output for later reference
        project.data.add(data)

    def execute(self):

        t = PythonTask(self)
        t.call(
            my_script,
            param1='', 
            param2=''
        )

        return t
        
def my_script(param1, param2):
    return {"whatever you want to return"}

and, inside the notebook, registering it in the db

from my_generator import MDTrajFeaturizer
project.storage.update_storable_classes()

seems to fix some of the errors. If I'm not wrong, also the file itself needs to be added to the project, which I did with

File('file://my_generator.py').load()

Now, according to project.storage.simplifier.class_list, the class is registered. Still, when I start a worker with a task from this generator, I get

...
  File "/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/dictify.py", line 250, in build
    obj['_cls'])
ValueError: Cannot create obj of class `MDTrajFeaturizer`.
Class is not registered as creatable! You might have to define
the class locally and call `update_storable_classes()` on your storage.

Any ideas? @marscher @jhprinz @euhruska

Complete API docs

Complete API docs of all user-accessible objects and functions. Currently only a small fraction of these have docs.

Events not working

When using the tutorial ipython notebooks, importing Event and FunctionalEvent is not possible. I remember there was some change in the API, but can't find a reference.

Simple test in README.md does not work

The simple test in the README doesn't work:

run a simple test

cd adaptive/tests/
python test_simple.py

The directory name is wrong. It should be 'adaptivemd/tests/'
The MongoDB database has to be running, which is not mentioned.
It generates an assertion error:

(python2) macbook>python ./test_simple.py
changed state to booting
changed state to running
task succeeded
removing worker dir
Traceback (most recent call last):
File "./test_simple.py", line 89, in
assert(os.path.exists(traj_path))
AssertionError

Modeller Arguments are not passed

If I'm not somehow missing something, the arguments that can be given to the PyEMMAAnalysis class (.args) are simply stored, but not passed. Even though I kind of like it (no more error messages...), this should somehow work. Best would be to make it congruent with the arguments passed to OpenMMEngine.

Create a RP worker

I think that would be a good addition. Currently you submit tasks to the queue and the stand-alone worker grab jobs to complete.

The RP implementation with a scheduler requires a user to explicitely place jobs in the correct order in the RP queue and they will be executed. I would like to copy the same single-worker mechanism to auto submit jobs to an RP queue.

That means you can just add all RP clusters to the execution as well. Example

# submit some trajectory jobs
project.queue(project.new_trajectory(pdb_file, 100) for f in range(1000))

then you would start a RP job worker that will create a session for some time grab some tasks to be executed.

adaptiverpworker -d {...} --cores=100 --resource=fub.allegro --walltime=02:00:00 project_name

this can run whereever it has access to the cluster and to the DB. Options should be similar to the adativeworker

Tutorial part 3

This refers to
https://github.com/markovmodel/adaptivemd/blob/master/examples/tutorial/3_example_adaptive.ipynb
While I'm working through the notebook, some of the questions / comments refer to the framework in general. Apologies if I am making comments or asking questions that have been addressed elsewhere.

General: The main thing I'd like to understand in this notebook is how to write an adaptive sampling script in general. For that I would need to be able to write the strategy function in such a way that I can access all relevant information from the database (model or statistics computed from the stored trajectories) in order to make a decision. So let's say I want to implement simulated annealing through the framework, i.e. whenever running a trajectory, I want to restart from the last frame of the previous trajectory, and I want to select the temperature to run at using a Monte-Carlo step that employs the potential energy and the temperature of the last frame. In order to implement that I guess we would need to store these quantities in the database after the run (how do we do that in principle?), and then access them from the strategy function (how do we access the project/database from here?). In general, these two essential aspects are not clear from the tutorials. They do not necessarily be in the tutorials, but should be extensively explained in the docs.
Cell 4: <StoredBundle for with 222 file(s) @ 0x11443ab50>. Why "for with" and what does the number in the end mean?
Cell 4: "Now restore our old ways to generate tasks" - what does that mean?
Cell 6: I would replace the call to map by a list comprehension as it's more widely known and map isn't very commonly used.
Cell 6: Puh, this needs more explanation. It's key to understand this function and how to use it. I think I have a very rough idea, but I'm still not very clear. So let's see:
- At first, I had the impression that this function would need a specific set of parameters and I was trying to find an API documentation that describes how to write it. I couldn't find that - now I have the impression that you can write whatever you want here (and with any list of parameters), but I'm not sure. So let's first describe who is calling this function when, in order to understand where it is placed in the control flow. Again a pictorial description of the control flow in the docs with an indication where this function is called and what it does would help.
- What I find most curious is that this function refers to project, engine and modeler, but these are defined outside the function. In a Jupyter notebook this will work, because variables are global, but how is this supposed to work in general? Of course it would be extremely important to be able to refer to the project inside this function such that you can compute whatever is needed in order to make the decision (unless you call a standard function). So how do we access these things in general? I guess passing a reference to the project to the function is probably a useful way to do this, because from there we can (hopefully) access all quantities needed.
- What do the yield statements do? I have to confess I've never used yield. I was aware that yield is something like a return used in generators, e.g. see here: http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python. However in your function yield appears to cause the process to wait until the subsequent expression is true (that's also what you write in the text). I couldn't find a documentation of yield which does that. Maybe this behavior is a side effect but it seems to me that this is not the conventional use of python yield. Please clarify. If the functionality can be implemented differently, I think that would be preferable, because I guess I'm not the only person who will be confused here.
- Nomenclature. What exactly is an event in this framework? In my understanding an event is something that happens, so why can you add an event to a project? Maybe you mean that you add an event listener or a function that react to events instead?
- Nomenclature. I have no clue what that FunctionalEvent could mean. Again, maybe you mean a function that responds to certain events - and if so what kind of events?
- project.add_event(strategy(loops=2)). I don't understand what happens here. You add this strategy function to the project and set the loops variable to 2 (so you run two adaptive loops with 4 workers per loop I guess). But why does add_event get a strategy? Why isn't is add_strategy? Maybe it's a terminology problem.
Cell 7: The note says something about RP (radical-pilot), but I guess the package doesn't use RP anymore or at least not by default, so this is confusing.
Cell 9 : what does while project._events: do? What does it mean if an event is True or False?
Cell 10: Now it reads like adding an event is basicly submitting a (macro)job or adaptive sampling job that consists of a pack of microjobs that are concerted on the computing resource. If that is what you mean by event, why not call it macrojob or something similar that is clearly understandable.
The rest of the notebook is really unclear, but it also seems unfinished, so I'm going to stop here for now.

Tutorial part 2

This refers to the tutorial part 2:
https://github.com/markovmodel/adaptivemd/blob/master/examples/tutorial/2_example_run.ipynb

General:
- I think it would be important to make clear that this notebook (if I understand correctly) defines an execution plan. If I understand it correctly, it just defines how trajectories will be run or restarted before anything has been run. Either way this isn't really clear from the text, and we need to make clear in both the text and the API what refers to plans or queues, and what actually reports the status of available data - these things need to be clearly separated.
- Many typos and grammar errors, many sentences are not clear (that's it, do whatever you need...). Generally, always avoid the words 'this' or 'that' whenever it's ambiguous what they mean (e.g. when they refer to a noun in one of the previous sentences, but there are multiple nouns they could refer to).
Cell 3: OK, so creating a Project with a name either creates a new Project if the name doesn't exist yet, or it loads the existing project under that name. That's not clear from the API docs and could also be made more explicit in the tutorial here.
Other issues / comments with that:
- it looks like anyone can delete any project by knowing its name. If people are using the same MongoDB server this is a bit dangerous - I'm not calling for a secure solution with passwords etc (takes too long to implement), but please think of a minimal solution that helps to avoid unintentional deletion of other people's projects. Maybe request to also specify the user name of a project in the delete command.
- is there a way to list the project in the database? Also how do you specify the database (The tutorial just creates a Project, so I guess there is a standard location or name for the database on the local host, but if it's not there, how do I specify which host/database to use?)
Cell 6: Nice. However, Files (probably also other objects) seem to have almost no API docs. Please complete API docs for all objects that are visible to the user.
"make the task" should be "create the task" (everywhere)
Cell 7: Trajectory object, parameter frame. Is this always the starting frame for the trajectory? Then give it a more explicit name.
Cell 10: That is a bit unclear and tricky indeed.
- I guess by "source trajectory" you mean the trajectory to be extended. Now you write the trajectory can only be run if the source trajectory exists - since the object apparently exists already at that point, I guess you mean that the extend command is only executed when the trajectory object is filled with the results of the first run - if correct, please make that clear in the tutorial.
- In general, I don't see how the correct order is guaranteed. For example, if I create 10 tasks, consisting of 1 task that runs the initial trajectory, and 9 extension tasks, then these tasks should run in sequence, i.e. it doesn't make sense if all or several of the 9 extension tasks start running after the first piece is available. Is this implemented? In general it's not clear at this point how the user can ensure an execution sequence.
- I feel that a "complete solution" of managing execution sequences would require implementing dependency graphs, which is beyond what's possible now. I'm perfectly fine with having a very minimal solution that just provides very basic tools of flow control (e.g. be able to give ordered task lists to make sure that they are always run in sequence). Just be explicit in the documentation about that and say what hasn't been done so far.
Cell 12: It's unclear if this is showing planned trajectories and lengths, or actually existing data. In the text it says we would expect a trajectory of length 100 and another of length 150. That is not shown in the output. I guess we only see the trajectory length 100 that was added to the project in cell 8 of this notebook, but if so, that's also confusing because the planned length is really 150 and at this point nothing has been run yet I guess. So how do we get information of all planned trajectories and all available data?
Cell 14: You say the task failed, but the output shows u'queued'. Why is it clear that it failed?
Cell 16: I don't understand the NOTE. Please reformulate / clarify.
Cell 17: I don't understand what "If you have the trajectory object and want to just create the trajectory, you can also put this in the queue." means. Also "do whatever you need" doesn't say anything.
Cell 17: At first it was unclear to me what is this cell supposed to show. Say this is a shortcut, where the project generates the task for you.
Cell 18: trajectory = project.trajectories.one what does that mean? Is the attribute actually called one and what is it? Do you want to refer to the first trajectories in the list?
Cell 18 shows now output, but the text below says "Good, at least 100 frames". What are we supposed to see here, and again would this show frames that have already been run, or just the planned length?
Cell 20: project.wait_until(condition) is extremely important, but just a side note here. Maybe highlight how to do flow control in a separate notebook or section.
Cell 22: should this be model / modeller or analysis? Make sure terminology is consistent.
Cell 27: What does project.find_ml_next_frame(4) mean, especially what's ml? Also, you say in the text that the simplest strategy is to use 1/c, but I don't see anything in the commands specifying that.
Cell 27: Now the term "Brain" appears, but that is not used anywhere else so far, so change to the actual nomenclature. Also what "This will be moved to the Brain" mean? What is "this"?
Cell 27/28: Relationship is not clear. Is executing Cell 27 required to run Cell 28? If so, why? Or are they independent, and if they're independent, what's the use of Cell 27 and how can we use the results from Cell 27 in Cell 28?
What about workers: Control and behavior of workers (also automatic shutdown when they don't receive tasks for a while) should be documented in a general doc page (link to it here)

Better worker control

Add shortcuts to shutdown all workers, etc. Or find a nice way to do something with all workers at the same time.

project.workers.shutdown()

# equivalent to
for w in project.workers:
    w.shutdown()

Some worker bug

Any ideas what this is about? Just ran into it with the new master branch code. Excuse that the error messages are quadrupled due to the 4 Worker approach.
job.914608.txt

Python 3 compatibility

PyEMMA's support for python 2 will soon expire (also, python 2's end of life is coming closer). That means that we need python 3 support.

worker job queuing might be inefficient

It might be inefficient to allow a worker to take more than one job at once, at least if a single job takes longer than a certain time. For example, I just submitted four trajectories to four workers. One worker has grabbed two jobs, basically making another worker unemployed. Would it be more efficient to restrict each worker to one single job? @nsplattner