choderalab / yank Goto Github PK

An open, extensible Python framework for GPU-accelerated alchemical free energy calculations.

License: MIT License

Python 99.74% Shell 0.21% Dockerfile 0.05%

molecular-dynamics molecular-dynamics-simulation alchemical-free-energy-calculations drug-discovery free-energy openmm mskcc python alchemical free-energy-perturbation

yank's Issues

Fix travis-ci for yank

Currently, it seems like it stalls on cloning mdtraj via github:

https://travis-ci.org/choderalab/yank/builds/20381725#L3206-L3210

Can I just use the pypi package instead, or a conda package?

Deprecate `from sets import Set`

Since Python2.4, set has been a built in type.

I'm pretty sure we can deprecate support for python 2.4--it's been ~2 years since I last used 2.6...

Eliminate duplicated functions

We should try to move as much as possible to a "standard" modules.

For example,

kyleb@kb-intel:~/src/kyleabeauchamp/yank$ cat yank/*.py|grep analyze_accept
def analyze_acceptance_probabilities(ncfile, cutoff = 0.4):
def analyze_acceptance_probabilities(ncfile, cutoff = 0.4):

I've seen other duplicated functions as well. IMHO, one way to deal with this is to create utils.py and dump any "utility" functions there--until we find a better place to put them.

How should we feed input to YANK?

We have to make some decisions about how we tell YANK what we want it to do.

To be specific, we need to tell it:

What input to use for the ligand, which might be a mol2 file, SDF file, IUPAC or common name, AMBER prmtop/inpcrd pair, etc.
What input to use for the receptor, which might eventually be a PDB file, a PDB ID, an AMBER prmtop/inpcrd pair, or even another small molecule (as in the host-guest case).
If anything isn't parameterized, we need to tell YANK how to assign parameters.
There are some other things we may need to tell YANK about how to set up systems in explicit solvent or build in missing atoms/residues.
There are some run parameters too, like how many iterations to use, what kind of restraints, etc. Most of this should eventually be fully automated, but there are a few parameters right now.

We have a few options for how to specify this:

Python scripts that use the Yank module. All parameters are coded in Python.
Command-line scheme, perhaps using Robert's commandline tool, so we can say something like
- yank setup to set up a calculation
- yank run to run/resume a calculation
- yank info to get some quick info on progress
- yank analyze to analyze a calculation
Some sort of input parameter file format, like XML or JSON

Thoughts?

Implement SystemBuilder tests

Would be useful to add a yank/tests/test_systembuilder.py set of nosetests that just make sure SystemBuilder constructs valid systems from various combinations of inputs.

Update analyze.py to be able to extract just hydration free energies

Use a library to print tables?

In several places in Yank, we have code that formats various tables (e.g. TProb).

If we don't mind a Pandas dependency, we could just do this:

T = pd.DataFrame(T)
T.to_string(formatter_lambda_function)

I already took the liberty of trying this in my Repex refactor.

If we don't do this, at the very least we should write one function that formats tables and try to re-use it as much as possible.

SystemBuilder systems explode in alchemically-modified states

So, when the SystemBuilder-made complex system is used to construct a yank object, the alchemical intermediate systems explode. Here is the code that I'm currently using to construct the exploding systems:

    import os
    import simtk.unit as unit
    import simtk.openmm as openmm
    import numpy as np
    import alchemy
    import simtk.openmm.app as app
    #os.environ['AMBERHOME']='/Users/grinawap/anaconda/pkgs/ambermini-14-py27_0'
    os.chdir('../examples/p-xylene')
    ligand = Mol2SystemBuilder('ligand.tripos.mol2', 'ligand')
    receptor = BiomoleculePDBSystemBuilder('receptor.pdb','protein')
    complex_system = ComplexSystemBuilder(ligand, receptor, "complex")
    complex_positions = complex_system.positions
    receptor_positions = receptor.positions
    print type(complex_system.coordinates_as_quantity)
    timestep = 1.0 * unit.femtoseconds # timestep
    temperature = 300.0 * unit.kelvin # simulation temperature
    collision_rate = 20.0 / unit.picoseconds # Langevin collision rate
    minimization_tolerance = 10.0 * unit.kilojoules_per_mole / unit.nanometer
    minimization_steps = 20
    plat = "CUDA"
    i=2
    platform = openmm.Platform.getPlatformByName(plat)
    forcefield = app.ForceField
    systembuilders = [ligand, receptor, complex_system]
    receptor_atoms = range(0,receptor.traj.top.n_atoms)
    ligand_atoms = range(receptor.traj.top.n_atoms,complex_system.traj.top.n_atoms)
    factory = alchemy.AbsoluteAlchemicalFactory(systembuilders[i].system, ligand_atoms=ligand_atoms)
    protocol = factory.defaultComplexProtocolImplicit()
    systems = factory.createPerturbedSystems(protocol)

    #test an alchemical intermediate and

    for p in range(1,len(systems)):
        print "now simulating " + str(p)
        if p==5:
            continue #for some reason 5 is poorly behaved
        integrator_partialinteracting = openmm.LangevinIntegrator(temperature, collision_rate, timestep)
        context = openmm.Context(systems[p], integrator_partialinteracting, platform)
        context.setPositions(systembuilders[i].openmm_positions)
        openmm.LocalEnergyMinimizer.minimize(context, minimization_tolerance, minimization_steps)
        outfile = open('out_test'+str(p)+'.pdb','w')
        app.PDBFile.writeHeader(systembuilders[i].traj.top.to_openmm(), outfile)
        for k in range(10):
            integrator_partialinteracting.step(100)
            state = context.getState(getEnergy=True, getPositions=True)
            app.PDBFile.writeModel(systembuilders[i].traj.top.to_openmm(), state.getPositions(), outfile,0)
        app.PDBFile.writeModel(systembuilders[i].traj.top.to_openmm(), state.getPositions(), outfile,0)
        app.PDBFile.writeFooter(systembuilders[i].traj.top.to_openmm(), outfile)
        outfile.close()

Implement automated hydration free energy calculations / test with FreeSolve dataset

Might be nice if we could do fast automated hydration free energies via OpenMM / Yank

Run pyflakes and pep8 (syntax checking and style checking)

I'm finding a lot of hidden syntax errors via pyflakes...

ModifiedHamiltonianExchange

So should ModifiedHamiltonianExchange be replaced by a "regular" HamiltonianExchange object with a particular choice of MCMC moveset? I'm trying to wrap my head around where each component belongs.

Migrate complicated doctests to nosetests

So we have a lot of doctests that are pretty complex, with ~20 lines of code.

It might be nice for us to reserve doctests for tests that are primarily illustrative--and putting more complex tests in a separate set of nosetests.

Slow tests in alchemy.py

Is this speed considered normal?

[reference_system, coordinates] = testsystems.LysozymeImplicit()

[...]

In [55]: %time reference_state = reference_context.getState(getEnergy=True)
CPU times: user 139.48 s, sys: 0.00 s, total: 139.48 s
Wall time: 139.62 s

[...]

In [69]: %time alchemical_state = alchemical_context.getState(getEnergy=True)
CPU times: user 142.45 s, sys: 0.02 s, total: 142.46 s
Wall time: 142.54 s

Do people have unmerged commits?

If so, we should file some [WIP] pull requests, so that we can each see what everyone is working on--this will help us avoid dealing with conflict resolution.

Avoid use of "from X import *" idiom

We should prefer either

import X
X.method()

from X import Y
Y()

IndexError complex_coordinates

When running yank.py example "p-xylene," an IndexError is thrown when yank.py attempts to access self.complex_coordinates[0].

It seems to be related to the following comment at the top of the code:

Handle complex_coordinates argument in Yank more intelligently if different kinds of input are provided.
Currently crashes if a Quantity is provided rather than a list of coordinate sets.

I'll play around with it.

Use logging for logging

I think we should be able to use logging to control the verbosity level without having to pass around verbose arguments everywhere.

It might be possible to make this work with MPI as well: https://github.com/jrs65/python-mpi-logger

YANK license?

What license do we want to use for YANK?

Currently, everything is GPL, but we may want to use a more permissive license, like LGPL.

It may be easiest to use the same license as OpenMM, unless there are issues with the libraries we use.

Check out bokeh for autogenerated reports

http://bokeh.pydata.org/

read_openeye_crd

What does this read and can it be deprecated?

Standardize SystemBuilder interface for getting positions

It looks like SystemBuilder does not have a standard interface for retrieving OpenMM-style Quantity-wrapped positions.

We should just have the thing.positions @property return standard Quantity-wrapped positions.

Accelerate _show_mixing_statistics()

The oldrepex._show_mixing_statistics() function (also in new repex) is useful, but slows down as the number of iterations increases. We should accelerate this with something like weave, cython, or the like.

Mol2SystemBuilder does not pass kwargs to antechamber

build_forcefield needs to pass keyword arguments to antechamber.

For instance, I can't parametrize a molecule with a net-charge until antechamber receives charge as input.

import numpy as np

IMHO, this is the "standard" way of using numpy imports.

Also, it might be good to try to use all numpy, whenever possible, rather than switching between math and numpy

Replace weave code in oldrepex.py with something more modern

Is there a modern replacement for weave that is fast but less clunky?

Ensure that doctests run in interactive mode.

Sometimes our doctests contain code that only works during a doctest, e.g.:

        >>> # Create a reference system.
        >>> import testsystems
        >>> [reference_system, coordinates] = testsystems.AlanineDipeptideImplicit()
        >>> # Create a factory.
        >>> factory = AbsoluteAlchemicalFactory(reference_system, ligand_atoms=[0, 1, 2])
        >>> factory._is_restraint([0,1,2])
        False
        >>> factory._is_restraint([1,2,3])
        True
        >>> factory._is_restraint([3,4])
        False
        >>> factory._is_restraint([2,3,4,5])
        True

Because we're calling AbsoluteAlchemicalFactory in a local namespace, this code will not run without first running from alchemy import *.

We should therefore explicitly import alchemy and use the module name in the docstring.

The key advantage of this is that it allow users to copy paste docstrings into their python session for learning purposes.

Standardize docstrings (Numpy / sphinx / readthedocs) and tests (nose / travis)

I personally think it's worth the day of work that it will take.

IMHO the easiest thing is to use MDTraj a guide (e.g. copy any necessary template files).

Use numpy docstring convention throughout

We should stick to the numpy docstring convention:
https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt

Split out testsystems into new project

@kyleabeauchamp : You want the following files:

yank/testsystems.py
yank/data/ - everything in this directory

User np.linspace / etc for spacings

IMHO we should avoid typing out 50 alchemical intermediates, e.g.:

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.95, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.925, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.90, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.85, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.80, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.75, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.70, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.675, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.65, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.60, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.55, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.50, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.40, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.30, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.20, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.10, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.05, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.025, 1.)) # 

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.00, 1.)) # discharged, LJ annihilated

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.00, 1.)) # discharged, LJ annihilated

+        alchemical_states.append(AlchemicalState(0.00, 0.00, 0.00, 1.)) # discharged, LJ annihilated

Add ability to estimate extrapolated free energy differences with perturbed systems

Switch to external repositories for Repex and TestSystems (when they are ready)

Also, this means that we shouldn't invest too much time cleaning up those files in the current Yank repo.

PS: I think TestSystems is pretty much ready to integrate, so that should happen first.

Add support for atom-by-atom alchemical intermediate definition

We can easily add an alternative alchemical intermediate generator based on this scheme:
http://dx.doi.org/10.1002/jcc.21829

Allow analysis function to use multiple NetCDF files (with their systems) for MBAR reweighting

Reading PDB files

Is there a reason that we manually parse PDB files instead of letting app.PDBFile.getPositions(asNumpy=True) do all the work for us?

Vacuum calculation: How should we handle this?

I've just removed the old pyopenmm pure-Python implementation of System and Force classes that I previously used to determine periodicity or remove Force objects from a System object. Now we have no way to create vacuum versions of molecules in case we want to compute hydration free energies in parallel with binding free energies.

It may be OK to simply leave this out and focus on binding free energies, since hydration free energies are extremely specialized anyway.

Yank.py has ValueError when yank.analyze() is called

File "yank.py", line 917, in analyze
[nequil, g_t, Neff_max] = timeseries.detectEquilibration(u_n)
ValueError: too many values to unpack

Find ways to reduce file sizes

Michael Shirts mentioned that his datasets for T4 lysozyme using old YANK were 44GB total for about 20 ligands. For his work on larger sets of proteins, he is generating ~4TB of data.

We should explore ideas for cutting down file sizes. Some obvious ones:

Enabling NetCDF compression by default
Saving checkpoint data less frequently, but energy data more frequently

Create Yank class for solvation free energies

Lee-Ping would like to be able to compute arbitrary solvation free energies for a molecule in another type of molecule.

Add getter and setter decorators for `yank.something` simulation parameters

Right now, many parameters in yank are set via the following scheme:

yank = Yank()
yank.n_iterations = 10
yank.timestep = 1.0
yank.other_thing = other_thing

To prevent insane combinations of these objects, it might be nice for us to use getter and setter decorators for all possible properties.

Use @rmcgibbo's commandline class to rework command-line app

Robert's MixTape has a pretty clean object oriented tool for building up command line apps. It's worth considering here.

https://github.com/rmcgibbo/mixtape/

Change license to LGPL

I'd like to change the license of yank to be LGPL as well. Any objections?

Eliminate mdtraj-specific public API for SystemBuilder

All the mdtraj-based stuff should be private (internal) only. We may support mdtraj-based stuff in the future when we convert to repex, but not yet.

Use MDTraj for trajectory alignment / quaternions

So Robert and I have done some work integrating RMSD and alignment features into MDTraj.geometry

We could possibly pull in some of the rotation stuff that's currently in Yank. One advantage is that it could be easier to maintain there...

Eliminate Trajectory Analysis code

So analyze.py has lots of trajectory analysis code that duplicates things I've already written in MDTraj. We can replace a lot of this redundant code with only a few lines of MDTraj code.

There is one remaining to-do item on the Repex side: a member function that slices the netCDF database and outputs MDTraj trajectories. (See, e.g. https://github.com/choderalab/repex/issues/49).

I think cleaning up this issue will lead to massive readability improvements and could make the Yank-Repex-MDTraj pipeline the go-to tool for analysis on repex datasets.

T4 lysozyme L99A + p-xylene (to compare to prmtop/inpcrd route)
- T4 lysozyme: PDB file, to be parameterized with app.ForceField with specified forcefield file(s) and implicit/explicit solvent choice
- p-xylene: mol2 file, to be parameterized with gaff2xml
CB[7] host-guest system
- CB[7] host: mol2 file
- hosts: list of IUPAC names
ligand design
- RCSB ID with chain identifiers, to be parameterized with app.Forcefield with specified forcefield file(s) and implicit/explicit solvent choice
- Chemdraw file containing one or more molecules, to be parameterized with gaff2xml

Add unit tests to SystemBuilder

All modules should have unit tests for classes and methods.

choderalab / yank Goto Github PK

yank's Issues

Recommend Projects

Recommend Topics

Recommend Org