mobleylab / freesolv Goto Github PK

View Code? Open in Web Editor NEW

95.0 25.0 53.0 213.28 MB

Experimental and calculated small molecule hydration free energies

Home Page: http://www.escholarship.org/uc/item/6sd403pz

Python 81.41% Jupyter Notebook 18.59%

free-energies hydration solvation experimental-data calculated-values python experimental-values database

freesolv's People

Contributors

Stargazers

Watchers

freesolv's Issues

Add scripts used to generate these files

@davidlmobley : We should capture the scripts you used to generate various parts of this repository.

Should SDF files contain other fields from database?

.sdf files can contain multiple key-value pairs, meaning we could store other fields from the database in these files.

Should we do this?
If so, which fields should we store? All of them, or a subset?
If a subset, which fields?

Protocol for generating solvated input files for various codes (AMBER, gromacs)

This issue is for discussing how we should generate solvated input files for various codes.

Questions:

Which codes should we support?
What is our goal? To make sure the systems are as identical as possible?

gromacs-mdp:with verlet lists rcoulomb!= rvdw is not supported

rvdw = 1.2, right?

GAFF version

I've searched around, but I can't seem to find which version of GAFF was used. The difference between GAFF v1.7 and v1.8 could be noticeable (or perhaps an even older version was used?) If the exact GAFF version cannot be determined, the Amber release should give me a pretty good idea.

mobley_352111 and mobley_4689084 are the same molecule

These two molecules have the same IUPAC name, SMILES string, identical experimental values and nearly identical calculated values. I think these are the only two entries in the set where the molecules are the same by the above criteria.

A few things are different such as the uncertainty estimate and the citation. I'm wondering if we should remove one of the entries?

Update column names to be more informative

Now that we're distributing FreeSolv in multiple forms (SDF files, Orion datasets), it would be useful if the column names (that become SD tags) were more informative and self-documenting about things like units. For example, the "expt" label is problematic since (1) it doesn't describe which of the many properties this is the experimental value for, and (2) it doesn't give the units. More informative names would very much help!

Flag molecules with possibly problematic tautomers (and investigate tautomers for them)?

OpenEye suggested a solution to flag molecules which potentially have multiple relevant tautomers. We ought to do this, or something similar, and at least flag such molecules, or possibly investigate them further via alternate tools such as Epik.

https://www.evernote.com/l/ABpN_7k6J4hPG4Ye4o06M7RKtc_ad4YKHJE

rebuild_freesolv.py script

Hello!

I have been trying to run the rebuild_freesolv.py script but it seems the molecules do not get correct coordinates, in fact they get generated with the coordinates of all atoms as zeros (as opposed to the files that are already available in the mol2files_sybyl folder). Do you know what the reason for this may be?

Best wishes,
Kadi

Strip water molecules from all topology/coordinate files in current database

Because of prior manual curation of files, not all topology and coordinate files contain water molecules. And additionally, I just found out (from Sereina Riniker - e-mail excerpt below) that some of these contain TIP4P-Ew water molecules rather than TIP3P. Again, this is a result of manually gathering the topology/coordinate files for these (in some cases by students). The best long-term solution is to re-generate all topology/coordinate files from original source data (Issue #20), but an interim solution is just to strip all water molecules from existing topology/coordinate files.

Riniker's e-mail said this, in part:
"Regarding the [input files] I noticed two things which I thought you might like to know if you do not already. In the most recent version v0.31, I encountered 78 molecules where the GROMACS coordinate file .gro does not contain the solvent coordinates. In addition, there are 23 molecules where the solvent model in the coordinate file is not TIP3P (it contains 4 coordinates per solvent molecule). I attach the list of molecule numbers in case you would like to have a look at them."

The compound ID numbers for setups with TIP4P are:
1323538
1728386
186894
1873346
1875719
1923244
2005792
2049967
20524
2068538
2178600
2972906
3053621
3727287
3738859
4035953
511661
5157661
525934
5449201
8427539
9055303
9979854

And those for setups with no water are:
1034539
1160109
1469079
172879
1893815
1905088
1944394
2126135
2316618
242480
2484519
2492140
2613240
2636578
2659552
2844990
2845466
2850833
2960202
2972345
3040612
3083321
3211679
3265457
3269819
3359593
3515580
3686115
3802803
3976574
4149784
4371692
4479135
4587267
4603202
4613090
4678740
4689084
486214
4936555
5003962
5006685
5282042
5371840
5456566
5510474
5538249
5561855
5616693
5917842
6102880
6190089
6195751
6198745
628951
6359156
667278
6688723
6935906
7239499
7417968
7676709
7913234
8052240
819018
8208692
8311303
8337722
8823527
8827942
8883511
9257453
9510785
9653690
9717937
9741965
9821936
9897248

Re-generate trajectories for all compounds - implicit solvent, explicit solvent, and vacuum

I've had some requests for trajectory files associated with the calculated hydration free energies - especially, for endpoint trajectories. I would like to see all of these re-generated via a consistent, reproducible protocol from the latest version of input files.

This could be part of an effort to re-calculate all of the hydration free energies with a consistent protocol (which has been discussed elsewhere) or simply a re-generation of endpoint trajectories separately from that.

Relatedly, there is also some interest in having calculated hydration free energies in implicit solvent, along with associated endpoint trajectories.

Thoughts?

Cannot generate GAFF mol2 from Tripos mol2 file

Processing the provided Tripos mol2 file for this molecule by antechamber leads to failure:

@<TRIPOS>MOLECULE
1,2,3,4,5-pentachloro-6-nitro-benzene
   14    14     1     0     0
SMALL
No Charge or Current Charge


@<TRIPOS>ATOM
      1 C1          1.8850   -1.0360   -0.1120 ca        1 MOL     -0.153400
      2 C2          2.9210   -1.6310    0.6080 ca        1 MOL      0.071500
      3 C3          2.8780   -1.6520    2.0020 ca        1 MOL      0.005700
      4 C4          1.8000   -1.0790    2.6760 ca        1 MOL      0.054600
      5 C5          0.7640   -0.4840    1.9550 ca        1 MOL      0.005700
      6 C6          0.8070   -0.4630    0.5610 ca        1 MOL      0.071500
      7 Cl1        -0.4790    0.2740   -0.3570 cl        1 MOL      0.005500
      8 Cl2        -0.5850    0.2340    2.7930 cl        1 MOL     -0.012400
      9 Cl3         1.7470   -1.1050    4.4180 cl        1 MOL     -0.009400
     10 Cl4         4.1750   -2.3960    2.8980 cl        1 MOL     -0.012400
     11 Cl5         4.2590   -2.3420   -0.2520 cl        1 MOL      0.005500
     12 N1          1.9280   -1.0150   -1.5410 no        1 MOL      0.316800
     13 O1          0.9850   -0.4760   -2.1600 o         1 MOL     -0.174500
     14 O2          2.9070   -1.5360   -2.1180 o         1 MOL     -0.174500
@<TRIPOS>BOND
     1    1    6 ar  
     2    1    2 ar  
     3    2    3 ar  
     4    3    4 ar  
     5    4    5 ar  
     6    5    6 ar  
     7    6    7 1   
     8    5    8 1   
     9    4    9 1   
    10    3   10 1   
    11    2   11 1   
    12    1   12 1   
    13   12   13 1   
    14   12   14 1   
@<TRIPOS>SUBSTRUCTURE
     1 MOL         1 TEMP              0 ****  ****    0 ROOT

To reproduce (using AmberTools 18.0):

$ antechamber -i in.mol2 -fi mol2 -o out.mol2 -fo mol2 -s 2 -at gaff2 -c bcc 

Welcome to antechamber 17.3: molecular input file processor.

acdoctor mode is on: check and diagnosis problems in the input file.
-- Check Format for mol2 File --
   Status: pass
Info: Finished reading file (in.mol2).
-- Check Unusual Elements --
   Status: pass
-- Check Open Valences --
   Status: pass
-- Check Geometry --
      for those bonded   
      for those not bonded   
   Status: pass
-- Check Weird Bonds --
/Users/choderaj/miniconda/bin/to_be_dispatched/antechamber: Fatal Error!
Weird atomic valence (5) for atom (ID: 12, Name: N1).
       Please check atomic connectivity.

Even running without request for charges leads to failure

$ antechamber -i in.mol2 -fi mol2 -o out.mol2 -fo mol2

Welcome to antechamber 17.3: molecular input file processor.

acdoctor mode is on: check and diagnosis problems in the input file.
-- Check Format for mol2 File --
   Status: pass
-- Check Unusual Elements --
   Status: pass
-- Check Open Valences --
   Status: pass
-- Check Geometry --
      for those bonded   
      for those not bonded   
   Status: pass
-- Check Weird Bonds --
/Users/choderaj/miniconda/bin/to_be_dispatched/antechamber: Fatal Error!
Weird atomic valence (5) for atom (ID: 12, Name: N1).
       Please check atomic connectivity.

I could not locate the script in this repo used to generate the GAFF mol2 files, so I could not check (1) which AmberTools version, and (2) which options were used.

Decide any other supporting files/data which ought to be captured when database is re-constructed from primary data

There are other supporting files/information aside from those we currently provide which have been requested:

Include frcmod files created in parameterization
Include ZINC compound IDs via https://github.com/rasbt/smilite
Better flagging of cases which might have problematic tautomers, perhaps using a mechanism like that suggested by OpenEye here: https://www.evernote.com/l/ABpN_7k6J4hPG4Ye4o06M7RKtc_ad4YKHJE

Decide what additional data we want pulled from the source literature/what instructions to give

I have many undergrads who want to do research. They could be enlisted to do additional curation on experimental data in this set, if we can come up with clear guidance on what we want and how to curate it. Some things we may want to have them find:

Temperature at which the measurements were conducted

Set up Travis-CI testing

I want to enable Travis-CI testing. At the very least it should probably:

Check that the info in the database is complete
Check for duplicate molecules
Check that reading in the mol2/sdf files into OEMols results in the same isomeric SMILES as those corresponding to the "source data"
Re-make the plots and re-compute the statistics (and check that statistics haven't changed?)

More ambitiously, it could also:

Rebuild input files (all of them is probably too slow, though this could be flagged as a "slow" test and skipped by Travis-CI; perhaps a test could rebuild some of the input files)
...?

Problems with processing some SMILES - omega returned error code 0

Hello!

When using the FreeSolv workflow for converting SMILES to .mol2 files I am encountering an error
RuntimeError: omega returned error code 0
for the SMILES copied below.
I would be grateful for suggestions how to overcome this. I am also attaching a copy of my conda environment.

Best wishes,
Kadi

OCc1cc(ccc1O)C(O)CNC(C)(C)C
CC(C)NCC(COc1ccc(cc1)CC(=O)N)O
O=S(c1ccc(cc1)\C=C3/c2ccc(F)cc2\C(=C3C)CC(=O)O)C
Fc2cc(ccc2c1ccccc1)C(C(=O)O)C
O=S(=O)(N=C(N)CCSCc1nc(/N=C(/N)N)sc1)N
CC(C)Cc1ccc(cc1)C(C)C(=O)O
CC(C)NCC(COc1cccc2c1cccc2)O
O=C(O)C(c1ccc(cc1)N3C(=O)c2ccccc2C3)C
CC(C)NCC(O)COc1cccc2[nH]ccc12
OC(CNC(C)(C)C)COc1cccc2c1CC@H C@HC2
c1ccc2c(c1)C(=O)N(C2=O)C3CCC(=O)NC3=O
CN3[C@H]1CC[C@@h]3CC@@HOC(=O)C(CO)c2ccccc2
O=S(=O)(c1ccc(cc1)C)NC(=O)NN3CC2CCCC2C3
OC(c1ccccc1)(CCN2CCCCC2)C3CCCCC3

environment.yml.txt

Make CHARMM input files via ParmEd

We already have AMBER and GROMACS files here; we should also provide CHARMM.

Is gromacs_energies available to download anywhere?

Hi,

I was wondering is gromacs_energies available to download anywhere?

Thank you,
Xiaowei

Add source for v0.31_docs.pdf

@davidlmobley : Can you send along the source for whatever generated v0.31_docs.pdf?

Create Release for v0.31

We should create official release version for v0.31

Naming convention for 'title' record in Tripos mol2 files

The Tripos mol2 files in mol2files_sybyl/ (which I think should be named to tripos_mol2/ is a mystery to me. Are these IUPAC names, processed in some way? What is the algorithm for generating these?

In next update, include a json format version of database

Pickle format is python-specific, so provide a JSON format version with the next update.

Potential duplicate molecules in FreeSolv Set

While typing FreeSolv molecules with smirnoff99Frosst, I found 4 molecules that are potentially duplicated in the FreeSolv set. Below is the code snippet I used that found the duplicates:

import glob
from openforcefield.utils import read_molecules
from openeye import oechem

# untarred mol2files_sybyl.tar.gz
DBpath = "/FreeSolv/mol2files_sybyl/*.mol2"
for file in glob.glob(DBpath):
	mol = read_molecules(file, verbose = False)[0]
	f = file.split('/')[-1]
	c_mol = oechem.OEMol(mol)
	oechem.OEAddExplicitHydrogens(c_mol)
	    smi = oechem.OECreateIsoSmiString(mol)
    f = file.split('/')[-1]
    if smi in isosmiles_to_mol:
        print("File:   %35s %35s" % (f, smi_to_file[smi]))
        print("Title:  %35s %35s" % (c_mol.GetTitle(), isosmiles_to_mol[smi].GetTitle()))
        print("SMILES: %35s %35s" % (smi, oechem.OECreateIsoSmiString(isosmiles_to_mol[smi])))
        print('\n')

    isosmiles_to_mol[smi] = c_mol
    smi_to_file[smi] = f

# OUTPUT: 

#File:                   mobley_4689084.mol2                  mobley_352111.mol2
#Title:               2-acetoxyethyl acetate              2-acetoxyethyl acetate
#SMILES:                    CC(=O)OCCOC(=O)C                    CC(=O)OCCOC(=O)C
#
#
#File:                   mobley_9897248.mol2                  mobley_819018.mol2
#Title:  (2Z)-3,7-dimethylocta-2,6-dien-1-ol (2E)-3,7-dimethylocta-2,6-dien-1-ol
#SMILES:                   CC(=CCCC(=CCO)C)C                   CC(=CCCC(=CCO)C)C
#
#
#File:                   mobley_9913368.mol2                 mobley_4465023.mol2
#Title:             (E)-1,2-dichloroethylene            (Z)-1,2-dichloroethylene
#SMILES:                           C(=CCl)Cl                           C(=CCl)Cl
#
#
#File:                   mobley_9979854.mol2                  mobley_628086.mol2
#Title:      (2R)-1,1,1-trifluoropropan-2-ol     (2S)-1,1,1-trifluoropropan-2-ol
#SMILES:                       CC(C(F)(F)F)O                       CC(C(F)(F)F)O

Implement Bayly's recommended AM1-BCC charge assignment scheme

I'm starting a PR (#11) to implement Christopher Bayly's recommended AM1-BCC charge assignment scheme. A script will be used to generate AM1-BCC charges from the primary data using the OpenEye tools.

I will clean up some other issues at the same time.

Have GBSA models been benchmarked on FreeSolv?

@davidlmobley: Have you benchmarked various GBSA models on the FreeSolv dataset yet? We're just wondering what the expected RMSEs are here.

Delete ANTECHAMBER.AC file in top directory?

I uploaded all the files from the v0.31 release as the first commit. However, it seems like there were some stray antechamber files living in the top directory of the tarball. Can we delete those @davidlmobley ? If so I'll file a Pull Request

Re-compute hydration free energies for all compounds

Compute explicit solvent hydration free energies for all compounds
- Store both charging and non-polar (vdW) components separately - these are useful to implicit solvent developers
Compute implicit solvent hydration free energies for all compounds
- Again, polar and non-polar components should both be stored

If this is handled prior to trajectory regeneration (#15) it should be done at the same time.

Sanitize SDF files

The current SDF files have about ~40 molecules in SDF format that are non-neutral. Here's a script that regenerates correct ones.

import csv
import os
from rdkit import Chem
from rdkit.Chem import AllChem

def is_neutral(mol):
    net_charge = 0
    for a in mol.GetAtoms():
        net_charge += a.GetFormalCharge()
    return net_charge == 0

mols = []

mmff_fail_count = 0

with open('database.txt', newline='') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=';', quotechar='|')
    for line, row in enumerate(spamreader):
        if line > 2:
            name = row[0]
            smiles = row[1]
            
            
            mol = Chem.MolFromSmiles(smiles)      
            mol = Chem.AddHs(mol)
            
            print(smiles)
            res = AllChem.EmbedMolecule(mol)
            assert res == 0 
            res = AllChem.MMFFOptimizeMolecule(mol)
            
            if res != 0:
                mmff_fail_count += 1

            exp_dG = float(row[3])
            exp_dG_err = float(row[4])
            

            mol.SetProp('_Name', name)
            mol.SetProp('dG', str(exp_dG))
            mol.SetProp('dG_err', str(exp_dG_err))
            
            assert is_neutral(mol)
            
            mols.append(mol)

print("mm_fail", mmff_fail_count)

w = Chem.SDWriter('freesolv.sdf')
for m in mols: w.write(m)
w.flush()

print("wrote", len(mols), "mols")

Re-construct database files from primary data

This is implied by several other issues, but ought to exist as its own issue.

To complete this issue, we must first resolve Issue #14, #13 , and #19.

This issue must be completed before we can resolve Issues #15 and #16 .

Resolving this issue will also provide the best resolution of the issue with water molecules in topology files (# to be inserted here).

Migrate issues and close this repo?

@jchodera - do you have any objection to closing this repo, or at least updating the README.md to reflect that the official repo is elsewhere, if I migrate the issues over to github.com/mobleylab/freesolv?

Now that my group is up on GitHub, it would probably be best for me to continue to maintain this on our site rather than yours, especially since we have some updates coming soon.

Resolve issues relating to trajectory re-generation and re-calculation of hydration free energies

What format should the trajectories be made available in?
Is it OK if all of these are generated via OpenMM?
I presume we should use the same GAFF/AM1-BCC files, including the same exact charges; we presumably want to use the new consistent mol2 files I generate with a single script from PR #11, right?

mobley_3323117 (sulfolane) has non-standard SMILES

Molecule mobley_3323117 (sulfolane) is written with the non-standard SMILES C1CC[S+2](C1)([O-])[O-], rather than the more standard C1CCS(=O)(=O)C1.

Despite being equivalent in total charge, these forms are inequivalent due to the provided formal charges (+2 for S, -1 for O) vs the standard SMILES (all atoms have 0 formal charge), which are rendered inequivalent in molecular representations in the OpenFF toolkit (with the OpenEye backend):

>>> from openff.toolkit.topology import Molecule
>>> freesolv_molecule = Molecule.from_smiles('C1CC[S+2](C1)([O-])[O-]')
>>> standard_molecule = Molecule.from_smiles('C1CCS(=O)(=O)C1')
>>> freesolv_molecule.generate_unique_atom_names()
>>> standard_molecule.generate_unique_atom_names()
>>> [(atom.name, atom.formal_charge.m) for atom in freesolv_molecule.atoms]
[('C1x', 0), ('C2x', 0), ('C3x', 0), ('S1x', 2), ('C4x', 0), ('O1x', -1), ('O2x', -1), ('H1x', 0), ('H2x', 0), ('H3x', 0), ('H4x', 0), ('H5x', 0), ('H6x', 0), ('H7x', 0), ('H8x', 0)]
>>> [(atom.name, atom.formal_charge.m) for atom in standard_molecule.atoms]
[('C1x', 0), ('C2x', 0), ('C3x', 0), ('S1x', 0), ('O1x', 0), ('O2x', 0), ('C4x', 0), ('H1x', 0), ('H2x', 0), ('H3x', 0), ('H4x', 0), ('H5x', 0), ('H6x', 0), ('H7x', 0), ('H8x', 0)]

Would it be reasonable to correct the non-standard SMILES string and re-generate the database?
Or are there ways to automatically standardize the formal charges?

Which database format should store primary data?

Currently, there is a Python pickle file that stores both primary data and derived data. This is very convenient for Python, but less convenient for anything that is not Python.

I wonder if we want to keep just the primary data in a nice, portable, small file from which everything (including convenient Python pickles) is derived. But what format should this be?

Python pickle (still not super convenient)
JSON?
XML?
SQLite?

As a reminder, we decided the primary data consisted of the following:

canonical isomeric SMILES
experimental data:
- experimental value
- experimental uncertainty
- citation for experimental data
notes field

Eventually, it would be great if there was also more provenance data for the experimental value (e.g. if Peter Guthrie had computed it from combining data from multiple publications and applying a conversion) but this is a more advanced topic.

mobleylab / freesolv Goto Github PK

freesolv's People

Contributors

Stargazers

Watchers

Forkers

freesolv's Issues

Recommend Projects

Recommend Topics

Recommend Org